Learning R while exploring illusory correlations

Learning R while exploring statistics
This exercise is designed to help you learn R while at the same time gaining insights into the
phenomenon of illusory correlation. We will go through the following steps:
1. Downloading R and R Studio, an interface to the R programming language that is rather
easier to work with than the basic interface.
2. Familiarisation with basic operations in R
3. Generating simulated data: two correlated variables, X and Y
4. Generating simulated data from two groups with different means on uncorrelated variables
U and V to demonstrate spurious correlation between U and V.
5. Demonstrating how incorporating group identity in a linear model unmasks the spurious
nature of the correlation between U and V.
6. Demonstrating how removing the effect of group will be misleading if group identity is highly
dependent on one of the variables.

These instructions apply to those working on PC, and I don't know whether equivalent on
Mac.
For steps 5 and 6 it's assumed you have a basic understanding of simple regression.

1. Downloading R and R Studio
Downloading R
R is a powerful language for statistical computing, but much of the documentation is written
for experts, and so it can be daunting for beginners. If you go to the website:
http://www.r-project.org/
You will see instructions for how to download R. Do not be put off by the instruction to
"choose your preferred CRAN mirror": this just means you should select a download site from
the list provided that is geographically close to where you are.
You may then be offered further options that you may not fully understand. Just persevere by
selecting the 'windows' option from the "Download and install R" section, and then select
'base', which at last takes you to a page with straightforward download instructions.
Installation of R will create a Start Menu item and an icon for R on your desktop.

Downloading R Studio
To download R studio, go to this website and follow the instructions.
http://rstudio.org/

If for any reason you prefer not to use R Studio, the examples should all work from the
original R interface, but your screen may look different, and it may be difficult to arrange items
such as figures in a sensible way.

2. Familiarisation with basic operations in R
After opening R Studio your screen will be divided into several windows. Move your cursor to
the window called R console, in which you can type commands.
You will see a > cursor.
This cursor will not be shown in the examples below, but it indicates that the console is
awaiting input from you.
At the > cursor, type:

help.start()
As with other programming languages, you hit Enter at the end of each command.
This will open a window showing links to various manuals. You may want to briefly explore
this before going further.

Just to familiarise yourself with the console, type:
1+2

R evaluates the expression and you see output:
[1] 3

th
Version 1.1 9 June 2012 1

The [1] at the beginning of the output line indicates that the answer is the first row of the
variable. This looks confusing if you just have a single number, as in this case, but, as we will
see, output can consist of an array of numbers.

Now type:
x = 1+2
Nothing happens. But the variable x has been assigned, and if you now type x on the
console, you will again see the output

[1] 3

In R, the results of variable assignments are not shown automatically, but you can see them
at any time by just typing the name of the variable.
You can also see all current variables in the Workspace screen on the right.

The value assigned to variable x will remain assigned unless you explicitly remove it using the
'rm' command. Type:
rm(x)
You now see that x has disappeared from the
If you type x again, the console gives the message:
Error: object 'x' not found

You can repeat an earlier command by pressing the up arrow until it reappears. Use this
method to redo the assignment x=1+2, and then type X. Again you get the error message,
because R is case-sensitive, and so X and x are different variables.

Now type:

y = c(1, 3, 6, 7)

The workspace tells you y is a numeric variable with four values, i.e. a vector.
To see the values, type y on the console. You will see the vector of numbers [1 3 6 7]. The 'c'
in the previous command is not a variable name, but rather denotes the operation of
concatenation. It just instructs R to create a variable consisting of the sequence of material
that follows in brackets.

Now type:
x=
and hit Enter.
The cursor changs to +
This is R telling you that the command is incomplete. If you now type 1+2 followed by Enter,
your regular cursor returns, because the command is completed.

It can happen that you start typing a command and think better of it. To escape from an
incomplete command, and restore the > cursor, just hit Escape.

The Console is useful for doing quick computations and checking out commands, but in
general, when you do computations, you will want to use a script, i.e. a set of commands that
you can save, so you can repeat the sequence of operations at any time. The script is written
in the Source window (also known as the Editor window).

From the menu at the top of the screen select File|New|R script.
You will see a new tab in the Source window, labelled Untitled1. You want to save it with a
name. Select a name such as Demo1 and type this in Source window, preceded by the
symbol #.
It is important that the name contains no blank spaces.
If you make a script name with blank spaces, this can create havoc later on, because when
you try to execute it, R will interpret all but the first word as commands, and you will get
misleading error messages that will have you scratching your head as to what they mean.

th

The hash symbol that you typed before the script name is used to create a comment in a
script, i.e. a line that is used to remind the user of important information, but which is not
executed when the script runs. It is customary to put the title of the script, plus information
about it function, author and date at the head of the script.
Select the menu command File|SaveAs to save the script with that name.

Currently, your script doesn't do anything. Let's give it some content.
In the Source window type:
x=2+3
y=4+5
z=x+y

Now select the top menu item Edit|Run Code|Run All.
As the script executes, you will see the commands in the script repeated in the Console
window, and the values of the variables x, y and z in the Workspace window.
These variables will remain assigned to these values until explicitly cleared.
You can test this by typing a command at the console such as:

x-y

which will give the answer -4.

Important: Traditionally, R scripts use <- instead of =.
So, you will see instances of scripts which have commands such as
a <- 1+3.
This is equivalent to
a = 1+3.
It is also possible to have the arrow going the other way , i.e., 1+3 -> a, which means the
same thing.
My view of life is that you should never make two keystrokes when one will do, and so I
persist with the use of the equals sign, but R purists disapprove of this.
One reason for avoiding = in assigning values to variables, is that it can be confusing,
because the equals symbol is also used in other contexts, such as judging whether two things
are the same. For the present, I'm not going to worry you further about this, but you may want
to squirrel that fact away. Confusion between different uses of the = operator causes much
grief, not just in R but in most programming languages.

Loops: A loop is a way of repeatedly executing the same code. Suppose we wanted to print
out the ten times table, we could type 1*10; 2*10, 3*10, and so on. But a simpler method is to
use a loop, where we multiply 10 times a variable, myx, and specify the range of values that
myx will take at the start of the loop. Thus we can type in the commands:
for (myx in c(1:10))
{
print(10*myx)
}
The first line specifies the values that myx can take, i.e. c(1:10), which is the values 1, 2, 3, 4,
5, 6, 7, 8, 9, 10. The program executes all the commands between curly brackets repeatedly,
incrementing the value of myx each time it does so, until it gets to the final value, whereupon
it exits the loop.

Stopping a program: Sometimes a program has been written in a way that it keeps running
and never stops. If you need to abort you just type Ctrl+c.

Commenting: A good script will contain many lines preceded by #
This indicates that the line is a comment – it does not contain commands to be executed, but
provides explanation of how the script works.

Before you go any further, create a new directory that will contain all of your scripts, data, and
workspace for a project. Then go to the menu and select Tools|Set Working
Directory|Choose Directory and navigate to your new directory. This means that all your

th

work will be saved in one place. Whenever you start up R from a file in that directory, it will
continue as your working directory.

A note on quotes: If you paste a script into your R console or browser, quotes may get
reformatted, causing an error. Always check: for R, single quotes should be straight quotes,
not 'smart quotes' (i.e. quotes that slant or curl in a different direction at the start and end of a
quoted section). You may need to retype them if your system has reformatted them.

Further reading
The best way to learn R is to play with it. You should try typing in commands to see what
happens. Use the R Manuals from the Help screen to get started.

In addition, these texts are recommended.
Braun, W. J., & Murdoch, D. J. (2007). A first course in statistical programming with R.
Cambridge: Cambridge University Press.
Crawley, M. J. (2007). The R Book. Chichester, UK: Wiley.
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S, 4th edition. New
York: Springer. (Do not be put off by the title: it really should be entitled 'with S and R')

3. Generating simulated data: two correlated variables, X and Y
An important aspect of R is the ease with which you can generate simulated data.
Playing with simulated data is one of the best ways of gaining an intuitive grasp of statistics.
You can create a dataset with certain characteristics, and then see what happens when you
analyse it in different ways.

Most introductions to statistics ignore the potential of simulated data, and simulation is often
seen as an advanced topic. My view is that it should be one of the first things you learn to do.

As a first exercise in running a script in R, we shall generate a simulated set of data for two
variables, look at some basic statistics for the variables, plot them, and save the data. We will
be using the data for more interesting purposes later on, but for the time being, the aim is to
familiarise you with some key R commands. In addition, it is very useful to know how to
simulate datasets with specific characteristics, as these can be used to check how various
analyses work.

Unfortunately, most people do find R commands quite daunting, and the command needed to
create simulated data will probably look horrific if you are a newbie. Also, in R the help
commands are often not all that helpful, as they are written for statisticians. Don't lose your
nerve. I shall walk you through it and all will become clear.

One of the first things you need to understand about R is that there is a huge number of
functions that you can use to carry out various statistical, mathematical and graphic
operations, but they aren't all available when you start up R. Many of them are available in
'packages' which you have to specify if you want to use them. There's a nice explanation of
how you can find and use packages here: http://ww2.coastal.edu/kingw/statistics/R-
tutorials/package.html.

We're going to use commands from a package called MASS, which contains functions and
datasets from 'Modern Applied Statistics with S' by Venables and Ripley (see Recommended
Reading above).

All we need to do is to include the following line in our script:
require(MASS)

Once that command is executed, all the functions from MASS will be available for us to use.
When learning R, it's a good idea to run each new command and see what, if anything
happens, and whether the workspace changes. If you just highlight one or more commands in
the Editor window and then hit the Run button with a green arrow at the top of the window,
this just runs that command. If you run the 'require' command as above, the Console just
reassuringly tells you it is loading MASS.

th

Now, we're going to generate two columns of correlated numbers, X and Y.
We'll start by creating a variable to hold their names. The next line of your script should be:
mylabels=c('X','Y') # Put labels for the two variables in a vector

Remember: You could just omit the bit after the hash, which is a comment. It's there to
remind you what you are doing. It may be obvious now, but, trust me, it won't be if you come
back a week later. You should add your own comments, using language that will be helpful to
you.

If you run this command you will see that the Workspace shows mylabels as a character
variable with two values. It knows to treat mylabels as a character, rather than number
variable because you have enclosed the labels in quotes. Does it matter if you use single or
double quotes? I couldn't remember, so I just tried making a different variable by typing a
command on the console with double quotes - you should do the same. It's always good
practice to just play around with commands and see what happens.

We are going to use a fancy command from MASS called mvrnorm. It's not uncommon to
forget the precise format that a commands need, but help is at hand.
On the console type help(mvrnorm), and you will find that the Help screen shows you the way
the command is used. It first tells you what the arguments are for the command, i.e. the things
you need to specify to make it work, it then terrifies you with a more technical explanation,
and finally gives a worked example. The worked example may be helpful or may just baffle
you completely.

Let's look at mvrnorm. The help screen starts as follows:
mvrnorm(n = 1, mu, Sigma, tol = 1e-6, empirical = FALSE)
and then gives an account of what each argument is.

n the number of samples required.
mu a vector giving the means of the variables.
Sigma a positive-definite symmetric matrix specifying the covariance matrix of the
variables.
tol tolerance (relative to largest variance) for numerical lack of positive-definiteness
in Sigma.
empirical logical. If true, mu and Sigma specify the empirical not population mean and
covariance matrix.
Thus the first things you need to specify are the number of cases to simulate (n), the mean of
the variable, and the covariance matrix.

We are going to be working with z-scores, to make life easier. Remember that for z-scores, a
correlation is equivalent to a covariance, and the SD and variance are both equal to 1.

We first specify the correlation that we want:
myr = .5
Add that to your script, and run it, so that we have a value in myr.
For sigma, we need to specify the following matrix:
[1 myr
myr 1]

In R, you can create a matrix using the c (concatenate) command, but if you just typed c(1,
myr, myr, 1), then this wouldn't work. Why not? Try typing this at the console and see.
You'll find you have the right numbers, but they aren't in a 2x2 matrix. To get them properly
arranged, you need to explicitly specify that you want a matrix with two rows and two
columns.

So the full command is
mysigma = matrix(c(1,myr,myr,1),2,2)

th

The last two numbers in the command indicate we want 2 rows and 2 columns. Look at
mysigma. You could then try making another matrix, but with 1, 4 rather than 2, 2 at the end. I
can't stress enough that to understand commands, you just have to try them out. If you aren't
sure how something works, tweak a command and see what happens.

Note - there's nothing to stop you typing .5 rather than myr in the command above. It will give
the same answer. But we want a flexible script that will allow us to play around and look at
different values of the correlation, and if we use the variable myr in the code, rather than a
specific value, this allows us to do that easily.

So all you now need is to specify the number of cases and the mean values for X and Y.
We do this with the commands:

myn=50 # we're going to create 50 rows of data
mymean=c(0,0) #means are zero for both X and Y

We are now ready to go! What about the other arguments, tol and empirical? They are
optional and we'll leave them alone for the moment, though we will look at empirical later on.

We need a variable name for our simulated data. Let's call it myarray. So we type:
myarray=mvrnorm(n=myn, mymean,mysigma)

Now run the whole script. Each command is reflected in the console as it executes. But where
are the results? The workspace now confirms that you have created myarray which is a
matrix with 50 rows and 2 columns. To look at the results, just type myarray on the console.
There are your 50 paired z-scores!

Before going further, I'll just explain why I've created variables that all start with 'my'. This is
not essential, but it's a fairly common method. It has the advantage that you are unlikely to
inadvertently use a variable name that corresponds to an existing R command, and when
reading a script it makes it generally easier to distinguish your variables from other parts of R
language.

We have created paired variables, but they aren't yet labelled. Assigning column names to a
matrix in R is easy. Remember, we created mylabels earlier. We can assign these as our
column names as follows:
colnames(myarray)=mylabels

So now you have built up a whole script to generate paired numbers, which looks like this:

# simulate_XY
# Script to simulate z-scores X and Y, with specific correlation

require(MASS) # Load functions from Modern Applied Statistics for S
mylabels=c('X','Y') # Labels to be used later for our variables
myr=.5 # Correlation (can be changed)
mysigma = matrix(c(1,myr, myr,1),2,2) # 2 x 2 ccovariance matrix
# (with zscores, equiv to correlation matrix)
myn=50 # N rows of data to simulate
mymean=c(0,0) # Means for each variable (zero for zscores)
myarray=mvrnorm(n=myn, mymean,mysigma) # create array of simulated data
colnames(myarray)=mylabels #Assign labels to columns of simulated data

But you may be suspicious. How do you know that the numbers you have generated have
mean of zero and are actually correlated .5?
You can use R commands to find out.

This command gives you a range of descriptive statistics, including the means:
summary(myarray)

th

and this one gives the correlation matrix:
cor(myarray)

At this point, you may start to think (depending on your locus of control) either that you have
done something wrong, or that R is not very good. It's highly likely that your means will differ
from zero, and the correlation will be smaller or bigger than .5. The reason is that we did not
specify empirical = TRUE. R has faithfully generated a sample of observations from a
population of values where the true correlation is .5, but because of sampling error, the
observed value in this sample is likely to deviate from .5.

If you re-run the program, but this time alter the mvrnorm command to:
myarray=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE)
then you will find the means are zero (or, more likely a real number that is infinitesimally
small) and the correlation is .5.

Alternatively, you could remove the empirical command (or specify empirical=FALSE, which
has the same effect), but specify n = 50000, or another very large number. The larger the
sample you take from the population, the closer the sample correlation will approach to the
population correlation.

It's always a good idea to plot data as well as looking at summary statistics. To see a
scatterplot of your data, add this command to your script:
plot(myarray)
A graph will now pop up in the Plots tab of the right hand lower window.

Finally, you might want to save your simulated data so you can use them at a later time.
This command will write a data file to your current directory:
write.table(myarray,"mysimdata")
If you want to get your data back on another occasion, this command will read the saved data
into a matrix called newdata
newdata=read.table("mysimdata")

The mvrnorm command uses a random number generator, which means that each time you
run the script, different numbers will be generated. If you want to always get the same
numbers, you can do so by just specifying a 'seed' for the random number generator. This
can be any number, but provided it is the same number each time, you'll get the same result.
Just put this command somewhere before the mvrnorm command:
set.seed(2)

If you have started from scratch and got this far, then you should take a break and reward
yourself with a cup of coffee or whatever other substances hit the spot for you.

4. Generating simulated data from two groups with different means on uncorrelated
variables U and V
We're now going to apply what we've learned to generate data from two separate groups on
two variables that are uncorrelated. The only difference is that the means differ on both
variables for the two groups. Let's set means for X and Y for group A as -1 and for group B
they'll be 1. We'll generate 60 cases for each group. We'll call these datasets myarrayA and
myarrayB.

If you've followed what we've done so far, you should be able to work out how to do this. It will
be a good exercise to try, as you learn R by thinking it through, rather than by just copying.
But I'll give you a script to do it anyway, in case you get stuck:

#demo_spurious_corr_script

require(MASS) #Load functions from Modern Applied Statistics for S
mylabels=c('U','V')
myr=0 #U and V are uncorrelated, and so r is set to zero
mysigma = matrix(c(1,myr, myr,1),2,2)

th

myn=60
set.seed(3)
#Array for group A
mymean=c(-1,-1) #mean zscore for group A
myarrayA=mvrnorm(n=myn, mymean,mysigma) #Generate uncorrelated U and V for grp A
colnames(myarrayA)=mylabels
summary(myarrayA)
cor(myarrayA)
plot(myarrayA)

#Array for group B
mymean=c(1,1) #mean zscore for group B
myarrayB=mvrnorm(n=myn, mymean,mysigma)#Generate uncorrelated U and V for grp B
colnames(myarrayB)=mylabels
summary(myarrayB)
cor(myarrayB)
plot(myarrayB)

We now want to combine the two arrays into one long column, and call this combined array
with a new name, myarrayAB.

This can be achieved with a single command for concatenating rows, as follows:
myarrayAB=rbind(myarrayA,myarrayB)

We can then look at the correlation for the combined groups:
cor(myarrayAB)

Even though the correlation within either group was set to zero, the correlation for the
combined groups is around .5 and highly significant. This is the phenomenon of spurious
correlation.

To make it more concrete, consider if U and V were height and chest hairiness and groups A
and B were males and females. Since men tend to be taller and hairier than women, you
could find a spurious correlation between height and hairiness in a combined group, even
though they are uncorrelated within either sex.

One reason I like simulations is that they can give you new insights into such phenomena.
Note that we specified massive mean differences between our groups: one group with a
mean z-score of +1 and the other with mean z-score of -1. When I first attempted this
simulation, I used much smaller group differences, and was surprised at how hard it was to
generate a spurious correlation. With a simulation like this, you can play around and get a
good feel for the phenomenon by repeatedly generating datasets with different values. The
phenomenon of spurious correlation is a source of major concern, especially for those
interested in correlational data, but my impression is that its importance may have been
overemphasised, because in practice it doesn't become a problem except in quite extreme
situations where you have two groups with very different mean values.

5. Demonstrating how incorporating group identity in a linear model unmasks the
spurious nature of the correlation between U and V

Let us stick with the interpretation of our simulated data as representing height and hairiness
in males and females (ignoring the fact that the group mean differences are vastly greater
than would be realistic). We now need to add to our combined dataset another column that
specifies gender.

The R command rep will just create a vector of repeated numbers. We make a set of 60
values = .5 for males, and 60 values = -.5 for females. The reason for picking these specific
values is because it helps interpretation of regression output if we set the average for two
groups to zero and make the mean difference between them equal to one. However, it's not

th

essential to do this, and you could have picked other numbers, such as 0 and 1 to indicate
group identity.

males=rep(.5,myn) #Create vector with myn repetitions of value .5
females=rep(-.5,myn) #Create vector with myn repetitions of value -.5

Having made our two sets of numbers, we then join them together in a variable called
mygender as follows:
gender=c(males, females)

Run these commands and then type gender at the console to check the result.

All that is now needed is to bolt this column on to our existing myarrayAB, which we can do
with a single command for concatenating columns, cbind.
myarrayAB=cbind(gender,myarrayAB)

Note that I have created a lot of intermediate variables in the course of generating
myarrayAB. This is unnecessary and uses up memory. It would be possible to combine
several steps in one command and so avoid creating the intermediate variables. However,
when learning R, I think it is helpful to break commands down into small steps and create new
variables, as this allows you to see the logic of what is being done, and to check the values of
each variable. It also makes your scripts easier to understand when you come back to them
later. Very experienced programmers may write much more compact code than this, but with
modern computers, memory is seldom a problem unless working with very large data arrays,
and so, apart from demonstrating how clever you are, compact code doesn't serve much
function.

We now want to do a regression analysis. We will start with simple regression of V on U for
the combined group data.
R has many powerful commands for doing regression, but it requires that the data are
formatted in what is called a data frame.
Fortunately, this transformation is trivially easy: we just add the command:
mydata=data.frame(myarrayAB)

Commands for regression in R are formulated in terms of the general linear model. This is a
very general and flexible approach to statistical analysis that readily incorporates the more
traditional methods beloved of psychologists such as analysis of variance. However, I suspect
that many psychologists reading this won't find it a very intuitive way to think about data, and
it takes a while to map the R commands onto pre-existing statistical knowledge.

The other thing that can be puzzling is that with programs such as SPSS, we are used to
running a command and then looking at the output screen. Although R can be used in an
analogous way, it is more usual to write the results to another variable. The variable that
holds the results is likely to be a fairly complex structure, as we shall see. But the basic idea
is that you don't just use a command to do the analysis: you actually specify a name for the
output of the analysis.

The simplest form of regression is pretty easy. The command lm just stands for linear model,
and requires two obligatory commands: you have to specify a formula that indicates the
relationship between predicted and predictor variables, and specify the dataset used to
estimate regression coefficients. So let's illustrate this with our U and V variables.
Add this command to the script:

myreg1=lm(V~U,mydata)

and then inspect the myreg1 variable that is created.
This contains two coefficients, an intercept, that is close to zero, and a slope, that is close to
0.5.
Note that when you type lm you also get information about the formula used to generate the
coefficients, labelled call. The output of lm contains a complex set of varied information in a

th

structure. If you want to look at just part of the structure, you have to use the $ sign to indicate
which bit. Try this, by just typing at the console:
myreg1$call
and
myreg1$coef

You will see that the portion after the $ indicates which bit of the myreg1 structure is referred
to.

The term V~U tells the program to fit a straight line according to the formula:
V = b1 + b2.U
where b1 is a slope and b2 is an intercept.

It is these slopes and intercepts that are then generated when the lm command is executed.

We can use these outputs to plot the regression line.
First plot the raw data. This command will achieve that:
plot(U~V, mydata)

The command abline plots a straight line with a given intercept and slope. You could add a
straight line through the intercept zero and with slope of 1, as follows:
abline(0,1)

The regression line is simply the straight line with intercept and slope corresponding to the
computed regression coefficients, and so can be plotted just by typing:
abline(myreg1$coef)

The lty command allows you to specify the type of line you want. This command will make the
regression line a dotted line.
abline(myreg1$coef,lty=5)

As an aside here, I haven't used R very much, and when I first saw a command with lty I was
confused and thought it was some kind of variable. This is, in my experience, a common
difficulty with R. Various letter sequences that look like variables or functions, aren't. What did
I do? I Googled "R lty" and immediately all became clear. Perhaps the single most important
advice if you want to learn R is to just use Google if you get stuck.

We now want to look at the regression with gender included. A simple modification to the
syntax achieves this. We have taken care to code gender so that the sum of the two gender
codes is zero, and we can include it in the linear model, even though it is a categorical
variable.
Here is the command:
myreg2=lm(V~U+gender,mydata)

This corresponds to the regression equation:
V = b1 + b2.U + b3.gender

If we type myreg2, we see that the output now has one intercept and two regression
coefficients, like this:
(Intercept) U gender
0.03357 0.04529 1.68913

Your values may differ from this because the simulated data will be different, but the overall
pattern will be similar. Note that the regression coefficient associated with U is now close to
zero, whereas that associated with gender is much bigger.

Once we have run the model we can get much more detailed statistical output by requesting a
summary, as follows:
summary(myreg2)

th

Now we have not only the coefficients, but their standard errors, associated t-values and
significance levels. This confirms that gender is a substantial predictor of V, and U is not.
Finally, you can use the anova command to produce an anova table comparing the fit of the
two models:
anova(myreg1,myreg2)

I've learned a lot about using R for regression analysis from this site. It also has information
on how to do diagnostic plots, for instance. However, for the present, I won't get diverted into
that, but will rather press on to look at what happens if you have groups defined on a variable
that is highly correlated with one of the dependent variables.

6. Demonstrating how removing the effect of group will be misleading if group identity
is highly dependent on one of the variables.

You should by now be able to follow this script, which is heavily commented to explain each
step. This time we are going to generate a multivariate normal distribution with 3 variables.
Two of them, L1 and L2 are language measures and A is an auditory measure. The language
measures show moderate correlation with the auditory measure and are highly intercorrelated
with one another. Group identity (control or language impaired, .5 or -.5) is defined in terms of
whether the score on L1 is above z-score of -1 or not. This, then, is analogous to the case of
dyslexia or language impairment, where we define whether or not the child has the diagnosis
on the basis of a low test score.

In a case like this, removing the effect of group can abolish the relationship between L2 and
A, simply because L1 and L2 are highly intercorrelated. It would be quite wrong to conclude
from this that L2 and A are not related.

#demo_spurious_corr_script3
# Using a group variable that is highly correlated with one variable

# With these settings, gives the result that by including SLI category
# you remove influence of L2
require(MASS) #Load functions from Modern Applied Statistics for S
mylabels=c('L1','L2','A') #3 variables, two language and one auditory
myr=.8 #correlation between the language measures
myr2=.3 #correlation of both language measures with auditory
mysigma = matrix(c(1,myr,myr2,
myr,1,myr2,
myr2,myr2,1),3,3)
myn=60
set.seed(6) #change or comment out this line to get different set of estimates

mymean=c(0,0,0) #Means for L1, L2, and A are zero
myarray3=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE)
colnames(myarray3)=mylabels
summary(myarray3)
cor(myarray3)
myL1=myarray3[,1] #first column

# Now determine which cases are control or SLI and put in mygroup variable
mygroup=rep.int(-1,myn) #default is SLI, coded -1
mycon=which(myL1> -1) #row index of those with L1 in con range
mygroup[mycon]=1 These rows are assigned group code of 1 (control)

myarrayAB=cbind(mygroup,myarray3) #add mygroup to the data array
mydata=data.frame(myarrayAB)

#Regression with only Group included
myreg1=lm(A~mygroup,mydata)

th

summary(myreg1)

#Regression with both group and L2 included
myreg2=lm(A~L2+mygroup,mydata)
summary(myreg2)

anova(myreg1,myreg2)

#Regression if we exclude group ID
myreg3=lm(A~L2,mydata)
summary(myreg3)

The point I want to make with this simulation is that if we want to 'take out' the effect of group
identity from a correlation, then we need to think carefully about the logic of what we are
doing.

In the previous example of spurious correlation, we defined gender quite independently of our
two measures, height and hairiness. Although males and females differed substantially on
both measures, their gender was not determined by those measures. In any logical causal
route, we can confidently treat gender as a primary cause, and so it makes sense to 'take out'
its effect.

For certain developmental disorders (and indeed other conditions), the causal route is much
less certain, because the disorder is diagnosed on the basis of measured variables. So, for
instance, dyslexia is defined in terms of low scores on reading measures. In the simulation
above, we looked at correlation between L2 and A, and defined our disorder in terms of L1 -
which was highly correlated with L2. We could have defined dyslexia in terms of L2 - you
might like to try that: it will achieve a similar effect. The results we got from our simulation are
actually sensible, but there is a danger they will be misinterpreted. What they are actually
telling us is that language measures and auditory measures are significantly correlated, and
this is evident regardless of whether we use a categorical language measure, where group
identity is determined by cutoff on a test, or a quantitative measure. What this analysis is
defintely not saying is that the correlation between language and auditory measures is
spurious.

It's possible to imagine a situation where you could have a spurious association with these
kinds of variables. For instance, poor social environment may affect both language measures
and auditory measures. To show that, we'd need to incorporate a measure of social
environment in our regression analysis. But the bottom line is that if we want to argue an
association between variables X and Y is spurious, we must have a third variable, Z, that is
(a) measureable and (b) not dependent on X or Y. Z may be highly correlated with X and Y:
that's not a problem. The problem is when Z is determined by X or Y.

th

Learning R while exploring illusory correlations

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Learning R while exploring illusory correlations

Similar to Learning R while exploring illusory correlations (20)

More from Dorothy Bishop

More from Dorothy Bishop (20)

Recently uploaded

Recently uploaded (20)

Learning R while exploring illusory correlations