SlideShare a Scribd company logo
1 of 12
Download to read offline
Learning R while exploring statistics
This exercise is designed to help you learn R while at the same time gaining insights into the
phenomenon of illusory correlation. We will go through the following steps:
1. Downloading R and R Studio, an interface to the R programming language that is rather
easier to work with than the basic interface.
2. Familiarisation with basic operations in R
3. Generating simulated data: two correlated variables, X and Y
4. Generating simulated data from two groups with different means on uncorrelated variables
U and V to demonstrate spurious correlation between U and V.
5. Demonstrating how incorporating group identity in a linear model unmasks the spurious
nature of the correlation between U and V.
6. Demonstrating how removing the effect of group will be misleading if group identity is highly
dependent on one of the variables.

These instructions apply to those working on PC, and I don't know whether equivalent on
Mac.
For steps 5 and 6 it's assumed you have a basic understanding of simple regression.

1. Downloading R and R Studio
Downloading R
R is a powerful language for statistical computing, but much of the documentation is written
for experts, and so it can be daunting for beginners. If you go to the website:
http://www.r-project.org/
You will see instructions for how to download R. Do not be put off by the instruction to
"choose your preferred CRAN mirror": this just means you should select a download site from
the list provided that is geographically close to where you are.
You may then be offered further options that you may not fully understand. Just persevere by
selecting the 'windows' option from the "Download and install R" section, and then select
'base', which at last takes you to a page with straightforward download instructions.
Installation of R will create a Start Menu item and an icon for R on your desktop.

Downloading R Studio
To download R studio, go to this website and follow the instructions.
http://rstudio.org/

If for any reason you prefer not to use R Studio, the examples should all work from the
original R interface, but your screen may look different, and it may be difficult to arrange items
such as figures in a sensible way.

2. Familiarisation with basic operations in R
After opening R Studio your screen will be divided into several windows. Move your cursor to
the window called R console, in which you can type commands.
You will see a > cursor.
This cursor will not be shown in the examples below, but it indicates that the console is
awaiting input from you.
At the > cursor, type:

  help.start()
As with other programming languages, you hit Enter at the end of each command.
This will open a window showing links to various manuals. You may want to briefly explore
this before going further.

Just to familiarise yourself with the console, type:
  1+2

R evaluates the expression and you see output:
 [1] 3




              th
Version 1.1 9 June 2012                                                                          1
The [1] at the beginning of the output line indicates that the answer is the first row of the
variable. This looks confusing if you just have a single number, as in this case, but, as we will
see, output can consist of an array of numbers.

Now type:
  x = 1+2
Nothing happens. But the variable x has been assigned, and if you now type x on the
console, you will again see the output

 [1] 3

In R, the results of variable assignments are not shown automatically, but you can see them
at any time by just typing the name of the variable.
You can also see all current variables in the Workspace screen on the right.

The value assigned to variable x will remain assigned unless you explicitly remove it using the
'rm' command. Type:
   rm(x)
You now see that x has disappeared from the
If you type x again, the console gives the message:
  Error: object 'x' not found

You can repeat an earlier command by pressing the up arrow until it reappears. Use this
method to redo the assignment x=1+2, and then type X. Again you get the error message,
because R is case-sensitive, and so X and x are different variables.

Now type:

 y = c(1, 3, 6, 7)

The workspace tells you y is a numeric variable with four values, i.e. a vector.
To see the values, type y on the console. You will see the vector of numbers [1 3 6 7]. The 'c'
in the previous command is not a variable name, but rather denotes the operation of
concatenation. It just instructs R to create a variable consisting of the sequence of material
that follows in brackets.

Now type:
  x=
and hit Enter.
The cursor changs to +
This is R telling you that the command is incomplete. If you now type 1+2 followed by Enter,
your regular cursor returns, because the command is completed.

It can happen that you start typing a command and think better of it. To escape from an
incomplete command, and restore the > cursor, just hit Escape.

The Console is useful for doing quick computations and checking out commands, but in
general, when you do computations, you will want to use a script, i.e. a set of commands that
you can save, so you can repeat the sequence of operations at any time. The script is written
in the Source window (also known as the Editor window).

From the menu at the top of the screen select File|New|R script.
You will see a new tab in the Source window, labelled Untitled1. You want to save it with a
name. Select a name such as Demo1 and type this in Source window, preceded by the
symbol #.
It is important that the name contains no blank spaces.
If you make a script name with blank spaces, this can create havoc later on, because when
you try to execute it, R will interpret all but the first word as commands, and you will get
misleading error messages that will have you scratching your head as to what they mean.




             th
Version 1.1 9 June 2012                                                                         2
The hash symbol that you typed before the script name is used to create a comment in a
script, i.e. a line that is used to remind the user of important information, but which is not
executed when the script runs. It is customary to put the title of the script, plus information
about it function, author and date at the head of the script.
Select the menu command File|SaveAs to save the script with that name.

Currently, your script doesn't do anything. Let's give it some content.
In the Source window type:
x=2+3
y=4+5
z=x+y

Now select the top menu item Edit|Run Code|Run All.
As the script executes, you will see the commands in the script repeated in the Console
window, and the values of the variables x, y and z in the Workspace window.
These variables will remain assigned to these values until explicitly cleared.
You can test this by typing a command at the console such as:

x-y

which will give the answer -4.

Important: Traditionally, R scripts use <- instead of =.
So, you will see instances of scripts which have commands such as
          a <- 1+3.
This is equivalent to
           a = 1+3.
It is also possible to have the arrow going the other way , i.e., 1+3 -> a, which means the
same thing.
My view of life is that you should never make two keystrokes when one will do, and so I
persist with the use of the equals sign, but R purists disapprove of this.
One reason for avoiding = in assigning values to variables, is that it can be confusing,
because the equals symbol is also used in other contexts, such as judging whether two things
are the same. For the present, I'm not going to worry you further about this, but you may want
to squirrel that fact away. Confusion between different uses of the = operator causes much
grief, not just in R but in most programming languages.

Loops: A loop is a way of repeatedly executing the same code. Suppose we wanted to print
out the ten times table, we could type 1*10; 2*10, 3*10, and so on. But a simpler method is to
use a loop, where we multiply 10 times a variable, myx, and specify the range of values that
myx will take at the start of the loop. Thus we can type in the commands:
          for (myx in c(1:10))
                   {
                   print(10*myx)
                   }
The first line specifies the values that myx can take, i.e. c(1:10), which is the values 1, 2, 3, 4,
5, 6, 7, 8, 9, 10. The program executes all the commands between curly brackets repeatedly,
incrementing the value of myx each time it does so, until it gets to the final value, whereupon
it exits the loop.

Stopping a program: Sometimes a program has been written in a way that it keeps running
and never stops. If you need to abort you just type Ctrl+c.

Commenting: A good script will contain many lines preceded by #
This indicates that the line is a comment – it does not contain commands to be executed, but
provides explanation of how the script works.

Before you go any further, create a new directory that will contain all of your scripts, data, and
workspace for a project. Then go to the menu and select Tools|Set Working
Directory|Choose Directory and navigate to your new directory. This means that all your


              th
Version 1.1 9 June 2012                                                                            3
work will be saved in one place. Whenever you start up R from a file in that directory, it will
continue as your working directory.

A note on quotes: If you paste a script into your R console or browser, quotes may get
reformatted, causing an error. Always check: for R, single quotes should be straight quotes,
not 'smart quotes' (i.e. quotes that slant or curl in a different direction at the start and end of a
quoted section). You may need to retype them if your system has reformatted them.

Further reading
The best way to learn R is to play with it. You should try typing in commands to see what
happens. Use the R Manuals from the Help screen to get started.

In addition, these texts are recommended.
Braun, W. J., & Murdoch, D. J. (2007). A first course in statistical programming with R.
Cambridge: Cambridge University Press.
Crawley, M. J. (2007). The R Book. Chichester, UK: Wiley.
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S, 4th edition. New
York: Springer. (Do not be put off by the title: it really should be entitled 'with S and R')

3. Generating simulated data: two correlated variables, X and Y
An important aspect of R is the ease with which you can generate simulated data.
Playing with simulated data is one of the best ways of gaining an intuitive grasp of statistics.
You can create a dataset with certain characteristics, and then see what happens when you
analyse it in different ways.

Most introductions to statistics ignore the potential of simulated data, and simulation is often
seen as an advanced topic. My view is that it should be one of the first things you learn to do.

As a first exercise in running a script in R, we shall generate a simulated set of data for two
variables, look at some basic statistics for the variables, plot them, and save the data. We will
be using the data for more interesting purposes later on, but for the time being, the aim is to
familiarise you with some key R commands. In addition, it is very useful to know how to
simulate datasets with specific characteristics, as these can be used to check how various
analyses work.

Unfortunately, most people do find R commands quite daunting, and the command needed to
create simulated data will probably look horrific if you are a newbie. Also, in R the help
commands are often not all that helpful, as they are written for statisticians. Don't lose your
nerve. I shall walk you through it and all will become clear.

One of the first things you need to understand about R is that there is a huge number of
functions that you can use to carry out various statistical, mathematical and graphic
operations, but they aren't all available when you start up R. Many of them are available in
'packages' which you have to specify if you want to use them. There's a nice explanation of
how you can find and use packages here: http://ww2.coastal.edu/kingw/statistics/R-
tutorials/package.html.

We're going to use commands from a package called MASS, which contains functions and
datasets from 'Modern Applied Statistics with S' by Venables and Ripley (see Recommended
Reading above).

All we need to do is to include the following line in our script:
require(MASS)

Once that command is executed, all the functions from MASS will be available for us to use.
When learning R, it's a good idea to run each new command and see what, if anything
happens, and whether the workspace changes. If you just highlight one or more commands in
the Editor window and then hit the Run button with a green arrow at the top of the window,
this just runs that command. If you run the 'require' command as above, the Console just
reassuringly tells you it is loading MASS.


              th
Version 1.1 9 June 2012                                                                             4
Now, we're going to generate two columns of correlated numbers, X and Y.
We'll start by creating a variable to hold their names. The next line of your script should be:
mylabels=c('X','Y')            # Put labels for the two variables in a vector

Remember: You could just omit the bit after the hash, which is a comment. It's there to
remind you what you are doing. It may be obvious now, but, trust me, it won't be if you come
back a week later. You should add your own comments, using language that will be helpful to
you.

If you run this command you will see that the Workspace shows mylabels as a character
variable with two values. It knows to treat mylabels as a character, rather than number
variable because you have enclosed the labels in quotes. Does it matter if you use single or
double quotes? I couldn't remember, so I just tried making a different variable by typing a
command on the console with double quotes - you should do the same. It's always good
practice to just play around with commands and see what happens.

We are going to use a fancy command from MASS called mvrnorm. It's not uncommon to
forget the precise format that a commands need, but help is at hand.
On the console type help(mvrnorm), and you will find that the Help screen shows you the way
the command is used. It first tells you what the arguments are for the command, i.e. the things
you need to specify to make it work, it then terrifies you with a more technical explanation,
and finally gives a worked example. The worked example may be helpful or may just baffle
you completely.

Let's look at mvrnorm. The help screen starts as follows:
mvrnorm(n = 1, mu, Sigma, tol = 1e-6, empirical = FALSE)
and then gives an account of what each argument is.

n         the number of samples required.
mu        a vector giving the means of the variables.
Sigma     a positive-definite symmetric matrix specifying the covariance matrix of the
          variables.
tol       tolerance (relative to largest variance) for numerical lack of positive-definiteness
          in Sigma.
empirical logical. If true, mu and Sigma specify the empirical not population mean and
          covariance matrix.
Thus the first things you need to specify are the number of cases to simulate (n), the mean of
the variable, and the covariance matrix.

We are going to be working with z-scores, to make life easier. Remember that for z-scores, a
correlation is equivalent to a covariance, and the SD and variance are both equal to 1.

We first specify the correlation that we want:
myr = .5
Add that to your script, and run it, so that we have a value in myr.
 For sigma, we need to specify the following matrix:
[1 myr
myr 1]

In R, you can create a matrix using the c (concatenate) command, but if you just typed c(1,
myr, myr, 1), then this wouldn't work. Why not? Try typing this at the console and see.
You'll find you have the right numbers, but they aren't in a 2x2 matrix. To get them properly
arranged, you need to explicitly specify that you want a matrix with two rows and two
columns.

So the full command is
mysigma = matrix(c(1,myr,myr,1),2,2)




             th
Version 1.1 9 June 2012                                                                           5
The last two numbers in the command indicate we want 2 rows and 2 columns. Look at
mysigma. You could then try making another matrix, but with 1, 4 rather than 2, 2 at the end. I
can't stress enough that to understand commands, you just have to try them out. If you aren't
sure how something works, tweak a command and see what happens.

Note - there's nothing to stop you typing .5 rather than myr in the command above. It will give
the same answer. But we want a flexible script that will allow us to play around and look at
different values of the correlation, and if we use the variable myr in the code, rather than a
specific value, this allows us to do that easily.

So all you now need is to specify the number of cases and the mean values for X and Y.
We do this with the commands:

myn=50       # we're going to create 50 rows of data
mymean=c(0,0) #means are zero for both X and Y

We are now ready to go! What about the other arguments, tol and empirical? They are
optional and we'll leave them alone for the moment, though we will look at empirical later on.

We need a variable name for our simulated data. Let's call it myarray. So we type:
myarray=mvrnorm(n=myn, mymean,mysigma)

Now run the whole script. Each command is reflected in the console as it executes. But where
are the results? The workspace now confirms that you have created myarray which is a
matrix with 50 rows and 2 columns. To look at the results, just type myarray on the console.
There are your 50 paired z-scores!

Before going further, I'll just explain why I've created variables that all start with 'my'. This is
not essential, but it's a fairly common method. It has the advantage that you are unlikely to
inadvertently use a variable name that corresponds to an existing R command, and when
reading a script it makes it generally easier to distinguish your variables from other parts of R
language.

We have created paired variables, but they aren't yet labelled. Assigning column names to a
matrix in R is easy. Remember, we created mylabels earlier. We can assign these as our
column names as follows:
colnames(myarray)=mylabels

So now you have built up a whole script to generate paired numbers, which looks like this:

# simulate_XY
# Script to simulate z-scores X and Y, with specific correlation

require(MASS) # Load functions from Modern Applied Statistics for S
mylabels=c('X','Y') # Labels to be used later for our variables
myr=.5              # Correlation (can be changed)
mysigma = matrix(c(1,myr, myr,1),2,2) # 2 x 2 ccovariance matrix
                                        # (with zscores, equiv to correlation matrix)
myn=50             # N rows of data to simulate
mymean=c(0,0) # Means for each variable (zero for zscores)
myarray=mvrnorm(n=myn, mymean,mysigma) # create array of simulated data
colnames(myarray)=mylabels #Assign labels to columns of simulated data

But you may be suspicious. How do you know that the numbers you have generated have
mean of zero and are actually correlated .5?
You can use R commands to find out.

This command gives you a range of descriptive statistics, including the means:
summary(myarray)




              th
Version 1.1 9 June 2012                                                                            6
and this one gives the correlation matrix:
cor(myarray)

At this point, you may start to think (depending on your locus of control) either that you have
done something wrong, or that R is not very good. It's highly likely that your means will differ
from zero, and the correlation will be smaller or bigger than .5. The reason is that we did not
specify empirical = TRUE. R has faithfully generated a sample of observations from a
population of values where the true correlation is .5, but because of sampling error, the
observed value in this sample is likely to deviate from .5.

If you re-run the program, but this time alter the mvrnorm command to:
myarray=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE)
then you will find the means are zero (or, more likely a real number that is infinitesimally
small) and the correlation is .5.

Alternatively, you could remove the empirical command (or specify empirical=FALSE, which
has the same effect), but specify n = 50000, or another very large number. The larger the
sample you take from the population, the closer the sample correlation will approach to the
population correlation.

It's always a good idea to plot data as well as looking at summary statistics. To see a
scatterplot of your data, add this command to your script:
plot(myarray)
A graph will now pop up in the Plots tab of the right hand lower window.

Finally, you might want to save your simulated data so you can use them at a later time.
This command will write a data file to your current directory:
write.table(myarray,"mysimdata")
If you want to get your data back on another occasion, this command will read the saved data
into a matrix called newdata
newdata=read.table("mysimdata")

The mvrnorm command uses a random number generator, which means that each time you
run the script, different numbers will be generated. If you want to always get the same
numbers, you can do so by just specifying a 'seed' for the random number generator. This
can be any number, but provided it is the same number each time, you'll get the same result.
Just put this command somewhere before the mvrnorm command:
set.seed(2)

If you have started from scratch and got this far, then you should take a break and reward
yourself with a cup of coffee or whatever other substances hit the spot for you.

4. Generating simulated data from two groups with different means on uncorrelated
variables U and V
We're now going to apply what we've learned to generate data from two separate groups on
two variables that are uncorrelated. The only difference is that the means differ on both
variables for the two groups. Let's set means for X and Y for group A as -1 and for group B
they'll be 1. We'll generate 60 cases for each group. We'll call these datasets myarrayA and
myarrayB.

If you've followed what we've done so far, you should be able to work out how to do this. It will
be a good exercise to try, as you learn R by thinking it through, rather than by just copying.
But I'll give you a script to do it anyway, in case you get stuck:

#demo_spurious_corr_script

require(MASS) #Load functions from Modern Applied Statistics for S
mylabels=c('U','V')
myr=0 #U and V are uncorrelated, and so r is set to zero
mysigma = matrix(c(1,myr, myr,1),2,2)


             th
Version 1.1 9 June 2012                                                                            7
myn=60
set.seed(3)
#Array for group A
mymean=c(-1,-1) #mean zscore for group A
myarrayA=mvrnorm(n=myn, mymean,mysigma) #Generate uncorrelated U and V for grp A
colnames(myarrayA)=mylabels
summary(myarrayA)
cor(myarrayA)
plot(myarrayA)

#Array for group B
mymean=c(1,1) #mean zscore for group B
myarrayB=mvrnorm(n=myn, mymean,mysigma)#Generate uncorrelated U and V for grp B
colnames(myarrayB)=mylabels
summary(myarrayB)
cor(myarrayB)
plot(myarrayB)

We now want to combine the two arrays into one long column, and call this combined array
with a new name, myarrayAB.

This can be achieved with a single command for concatenating rows, as follows:
myarrayAB=rbind(myarrayA,myarrayB)

We can then look at the correlation for the combined groups:
cor(myarrayAB)

Even though the correlation within either group was set to zero, the correlation for the
combined groups is around .5 and highly significant. This is the phenomenon of spurious
correlation.

To make it more concrete, consider if U and V were height and chest hairiness and groups A
and B were males and females. Since men tend to be taller and hairier than women, you
could find a spurious correlation between height and hairiness in a combined group, even
though they are uncorrelated within either sex.

One reason I like simulations is that they can give you new insights into such phenomena.
Note that we specified massive mean differences between our groups: one group with a
mean z-score of +1 and the other with mean z-score of -1. When I first attempted this
simulation, I used much smaller group differences, and was surprised at how hard it was to
generate a spurious correlation. With a simulation like this, you can play around and get a
good feel for the phenomenon by repeatedly generating datasets with different values. The
phenomenon of spurious correlation is a source of major concern, especially for those
interested in correlational data, but my impression is that its importance may have been
overemphasised, because in practice it doesn't become a problem except in quite extreme
situations where you have two groups with very different mean values.

5. Demonstrating how incorporating group identity in a linear model unmasks the
spurious nature of the correlation between U and V

Let us stick with the interpretation of our simulated data as representing height and hairiness
in males and females (ignoring the fact that the group mean differences are vastly greater
than would be realistic). We now need to add to our combined dataset another column that
specifies gender.

The R command rep will just create a vector of repeated numbers. We make a set of 60
values = .5 for males, and 60 values = -.5 for females. The reason for picking these specific
values is because it helps interpretation of regression output if we set the average for two
groups to zero and make the mean difference between them equal to one. However, it's not




             th
Version 1.1 9 June 2012                                                                           8
essential to do this, and you could have picked other numbers, such as 0 and 1 to indicate
group identity.

males=rep(.5,myn) #Create vector with myn repetitions of value .5
females=rep(-.5,myn) #Create vector with myn repetitions of value -.5

Having made our two sets of numbers, we then join them together in a variable called
mygender as follows:
gender=c(males, females)

Run these commands and then type gender at the console to check the result.

All that is now needed is to bolt this column on to our existing myarrayAB, which we can do
with a single command for concatenating columns, cbind.
myarrayAB=cbind(gender,myarrayAB)

Note that I have created a lot of intermediate variables in the course of generating
myarrayAB. This is unnecessary and uses up memory. It would be possible to combine
several steps in one command and so avoid creating the intermediate variables. However,
when learning R, I think it is helpful to break commands down into small steps and create new
variables, as this allows you to see the logic of what is being done, and to check the values of
each variable. It also makes your scripts easier to understand when you come back to them
later. Very experienced programmers may write much more compact code than this, but with
modern computers, memory is seldom a problem unless working with very large data arrays,
and so, apart from demonstrating how clever you are, compact code doesn't serve much
function.

We now want to do a regression analysis. We will start with simple regression of V on U for
the combined group data.
R has many powerful commands for doing regression, but it requires that the data are
formatted in what is called a data frame.
Fortunately, this transformation is trivially easy: we just add the command:
mydata=data.frame(myarrayAB)

Commands for regression in R are formulated in terms of the general linear model. This is a
very general and flexible approach to statistical analysis that readily incorporates the more
traditional methods beloved of psychologists such as analysis of variance. However, I suspect
that many psychologists reading this won't find it a very intuitive way to think about data, and
it takes a while to map the R commands onto pre-existing statistical knowledge.

The other thing that can be puzzling is that with programs such as SPSS, we are used to
running a command and then looking at the output screen. Although R can be used in an
analogous way, it is more usual to write the results to another variable. The variable that
holds the results is likely to be a fairly complex structure, as we shall see. But the basic idea
is that you don't just use a command to do the analysis: you actually specify a name for the
output of the analysis.

The simplest form of regression is pretty easy. The command lm just stands for linear model,
and requires two obligatory commands: you have to specify a formula that indicates the
relationship between predicted and predictor variables, and specify the dataset used to
estimate regression coefficients. So let's illustrate this with our U and V variables.
Add this command to the script:

myreg1=lm(V~U,mydata)

and then inspect the myreg1 variable that is created.
This contains two coefficients, an intercept, that is close to zero, and a slope, that is close to
0.5.
Note that when you type lm you also get information about the formula used to generate the
coefficients, labelled call. The output of lm contains a complex set of varied information in a


              th
Version 1.1 9 June 2012                                                                              9
structure. If you want to look at just part of the structure, you have to use the $ sign to indicate
which bit. Try this, by just typing at the console:
myreg1$call
and
myreg1$coef

You will see that the portion after the $ indicates which bit of the myreg1 structure is referred
to.

The term V~U tells the program to fit a straight line according to the formula:
V = b1 + b2.U
where b1 is a slope and b2 is an intercept.

It is these slopes and intercepts that are then generated when the lm command is executed.

We can use these outputs to plot the regression line.
First plot the raw data. This command will achieve that:
plot(U~V, mydata)

The command abline plots a straight line with a given intercept and slope. You could add a
straight line through the intercept zero and with slope of 1, as follows:
abline(0,1)

The regression line is simply the straight line with intercept and slope corresponding to the
computed regression coefficients, and so can be plotted just by typing:
abline(myreg1$coef)

The lty command allows you to specify the type of line you want. This command will make the
regression line a dotted line.
abline(myreg1$coef,lty=5)

As an aside here, I haven't used R very much, and when I first saw a command with lty I was
confused and thought it was some kind of variable. This is, in my experience, a common
difficulty with R. Various letter sequences that look like variables or functions, aren't. What did
I do? I Googled "R lty" and immediately all became clear. Perhaps the single most important
advice if you want to learn R is to just use Google if you get stuck.

We now want to look at the regression with gender included. A simple modification to the
syntax achieves this. We have taken care to code gender so that the sum of the two gender
codes is zero, and we can include it in the linear model, even though it is a categorical
variable.
Here is the command:
myreg2=lm(V~U+gender,mydata)

This corresponds to the regression equation:
V = b1 + b2.U + b3.gender

If we type myreg2, we see that the output now has one intercept and two regression
coefficients, like this:
(Intercept)             U      gender
0.03357             0.04529   1.68913

Your values may differ from this because the simulated data will be different, but the overall
pattern will be similar. Note that the regression coefficient associated with U is now close to
zero, whereas that associated with gender is much bigger.

Once we have run the model we can get much more detailed statistical output by requesting a
summary, as follows:
summary(myreg2)




              th
Version 1.1 9 June 2012                                                                           10
Now we have not only the coefficients, but their standard errors, associated t-values and
significance levels. This confirms that gender is a substantial predictor of V, and U is not.
Finally, you can use the anova command to produce an anova table comparing the fit of the
two models:
anova(myreg1,myreg2)

I've learned a lot about using R for regression analysis from this site. It also has information
on how to do diagnostic plots, for instance. However, for the present, I won't get diverted into
that, but will rather press on to look at what happens if you have groups defined on a variable
that is highly correlated with one of the dependent variables.


6. Demonstrating how removing the effect of group will be misleading if group identity
is highly dependent on one of the variables.

You should by now be able to follow this script, which is heavily commented to explain each
step. This time we are going to generate a multivariate normal distribution with 3 variables.
Two of them, L1 and L2 are language measures and A is an auditory measure. The language
measures show moderate correlation with the auditory measure and are highly intercorrelated
with one another. Group identity (control or language impaired, .5 or -.5) is defined in terms of
whether the score on L1 is above z-score of -1 or not. This, then, is analogous to the case of
dyslexia or language impairment, where we define whether or not the child has the diagnosis
on the basis of a low test score.

In a case like this, removing the effect of group can abolish the relationship between L2 and
A, simply because L1 and L2 are highly intercorrelated. It would be quite wrong to conclude
from this that L2 and A are not related.

#demo_spurious_corr_script3
# Using a group variable that is highly correlated with one variable

# With these settings, gives the result that by including SLI category
# you remove influence of L2
require(MASS) #Load functions from Modern Applied Statistics for S
mylabels=c('L1','L2','A') #3 variables, two language and one auditory
myr=.8 #correlation between the language measures
myr2=.3 #correlation of both language measures with auditory
mysigma = matrix(c(1,myr,myr2,
            myr,1,myr2,
            myr2,myr2,1),3,3)
myn=60
set.seed(6) #change or comment out this line to get different set of estimates

mymean=c(0,0,0) #Means for L1, L2, and A are zero
myarray3=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE)
colnames(myarray3)=mylabels
summary(myarray3)
cor(myarray3)
myL1=myarray3[,1] #first column

# Now determine which cases are control or SLI and put in mygroup variable
mygroup=rep.int(-1,myn) #default is SLI, coded -1
mycon=which(myL1> -1) #row index of those with L1 in con range
mygroup[mycon]=1 These rows are assigned group code of 1 (control)

myarrayAB=cbind(mygroup,myarray3) #add mygroup to the data array
mydata=data.frame(myarrayAB)

#Regression with only Group included
myreg1=lm(A~mygroup,mydata)


             th
Version 1.1 9 June 2012                                                                       11
summary(myreg1)

#Regression with both group and L2 included
myreg2=lm(A~L2+mygroup,mydata)
summary(myreg2)

anova(myreg1,myreg2)

#Regression if we exclude group ID
myreg3=lm(A~L2,mydata)
summary(myreg3)

The point I want to make with this simulation is that if we want to 'take out' the effect of group
identity from a correlation, then we need to think carefully about the logic of what we are
doing.

In the previous example of spurious correlation, we defined gender quite independently of our
two measures, height and hairiness. Although males and females differed substantially on
both measures, their gender was not determined by those measures. In any logical causal
route, we can confidently treat gender as a primary cause, and so it makes sense to 'take out'
its effect.

For certain developmental disorders (and indeed other conditions), the causal route is much
less certain, because the disorder is diagnosed on the basis of measured variables. So, for
instance, dyslexia is defined in terms of low scores on reading measures. In the simulation
above, we looked at correlation between L2 and A, and defined our disorder in terms of L1 -
which was highly correlated with L2. We could have defined dyslexia in terms of L2 - you
might like to try that: it will achieve a similar effect. The results we got from our simulation are
actually sensible, but there is a danger they will be misinterpreted. What they are actually
telling us is that language measures and auditory measures are significantly correlated, and
this is evident regardless of whether we use a categorical language measure, where group
identity is determined by cutoff on a test, or a quantitative measure. What this analysis is
defintely not saying is that the correlation between language and auditory measures is
spurious.

It's possible to imagine a situation where you could have a spurious association with these
kinds of variables. For instance, poor social environment may affect both language measures
and auditory measures. To show that, we'd need to incorporate a measure of social
environment in our regression analysis. But the bottom line is that if we want to argue an
association between variables X and Y is spurious, we must have a third variable, Z, that is
(a) measureable and (b) not dependent on X or Y. Z may be highly correlated with X and Y:
that's not a problem. The problem is when Z is determined by X or Y.




              th
Version 1.1 9 June 2012                                                                           12

More Related Content

What's hot

MATLAB programming tips 2 - Input and Output Commands
MATLAB programming tips 2 - Input and Output CommandsMATLAB programming tips 2 - Input and Output Commands
MATLAB programming tips 2 - Input and Output CommandsShameer Ahmed Koya
 
[ITP - Lecture 10] Switch Statement, Break and Continue Statement in C/C++
[ITP - Lecture 10] Switch Statement, Break and Continue Statement in C/C++[ITP - Lecture 10] Switch Statement, Break and Continue Statement in C/C++
[ITP - Lecture 10] Switch Statement, Break and Continue Statement in C/C++Muhammad Hammad Waseem
 
Error correction-and-type-of-error-in-c
Error correction-and-type-of-error-in-cError correction-and-type-of-error-in-c
Error correction-and-type-of-error-in-cMd Nazmul Hossain Mir
 
Chowtodoprogram solutions
Chowtodoprogram solutionsChowtodoprogram solutions
Chowtodoprogram solutionsMusa Gürbüz
 
Visual basic asp.net programming introduction
Visual basic asp.net programming introductionVisual basic asp.net programming introduction
Visual basic asp.net programming introductionHock Leng PUAH
 
Types of Statements in Python Programming Language
Types of Statements in Python Programming LanguageTypes of Statements in Python Programming Language
Types of Statements in Python Programming LanguageExplore Skilled
 
Basic c# cheat sheet
Basic c# cheat sheetBasic c# cheat sheet
Basic c# cheat sheetAhmed Elshal
 

What's hot (18)

MATLAB programming tips 2 - Input and Output Commands
MATLAB programming tips 2 - Input and Output CommandsMATLAB programming tips 2 - Input and Output Commands
MATLAB programming tips 2 - Input and Output Commands
 
Python Control structures
Python Control structuresPython Control structures
Python Control structures
 
[ITP - Lecture 12] Functions in C/C++
[ITP - Lecture 12] Functions in C/C++[ITP - Lecture 12] Functions in C/C++
[ITP - Lecture 12] Functions in C/C++
 
[ITP - Lecture 07] Comments in C/C++
[ITP - Lecture 07] Comments in C/C++[ITP - Lecture 07] Comments in C/C++
[ITP - Lecture 07] Comments in C/C++
 
A01
A01A01
A01
 
[ITP - Lecture 11] Loops in C/C++
[ITP - Lecture 11] Loops in C/C++[ITP - Lecture 11] Loops in C/C++
[ITP - Lecture 11] Loops in C/C++
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
[ITP - Lecture 10] Switch Statement, Break and Continue Statement in C/C++
[ITP - Lecture 10] Switch Statement, Break and Continue Statement in C/C++[ITP - Lecture 10] Switch Statement, Break and Continue Statement in C/C++
[ITP - Lecture 10] Switch Statement, Break and Continue Statement in C/C++
 
Error correction-and-type-of-error-in-c
Error correction-and-type-of-error-in-cError correction-and-type-of-error-in-c
Error correction-and-type-of-error-in-c
 
Chowtodoprogram solutions
Chowtodoprogram solutionsChowtodoprogram solutions
Chowtodoprogram solutions
 
[ITP - Lecture 15] Arrays & its Types
[ITP - Lecture 15] Arrays & its Types[ITP - Lecture 15] Arrays & its Types
[ITP - Lecture 15] Arrays & its Types
 
Visual basic asp.net programming introduction
Visual basic asp.net programming introductionVisual basic asp.net programming introduction
Visual basic asp.net programming introduction
 
Types of Statements in Python Programming Language
Types of Statements in Python Programming LanguageTypes of Statements in Python Programming Language
Types of Statements in Python Programming Language
 
[ITP - Lecture 14] Recursion
[ITP - Lecture 14] Recursion[ITP - Lecture 14] Recursion
[ITP - Lecture 14] Recursion
 
C++ lecture 01
C++   lecture 01C++   lecture 01
C++ lecture 01
 
Manual pseint
Manual pseintManual pseint
Manual pseint
 
Basic c# cheat sheet
Basic c# cheat sheetBasic c# cheat sheet
Basic c# cheat sheet
 
Cp module 2
Cp module 2Cp module 2
Cp module 2
 

Similar to Learning R while exploring illusory correlations

Similar to Learning R while exploring illusory correlations (20)

R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
 
Tutorial basic of c ++lesson 1 eng ver
Tutorial basic of c ++lesson 1 eng verTutorial basic of c ++lesson 1 eng ver
Tutorial basic of c ++lesson 1 eng ver
 
C programming perso notes
C programming perso notesC programming perso notes
C programming perso notes
 
Logo tutorial
Logo tutorialLogo tutorial
Logo tutorial
 
5. R basics
5. R basics5. R basics
5. R basics
 
Lecture1
Lecture1Lecture1
Lecture1
 
FULL R PROGRAMMING METERIAL_2.pdf
FULL R PROGRAMMING METERIAL_2.pdfFULL R PROGRAMMING METERIAL_2.pdf
FULL R PROGRAMMING METERIAL_2.pdf
 
Introduction to Programming and QBasic Tutorial
Introduction to Programming and QBasic TutorialIntroduction to Programming and QBasic Tutorial
Introduction to Programming and QBasic Tutorial
 
Lab4 scripts
Lab4 scriptsLab4 scripts
Lab4 scripts
 
JavaScript: Core Part
JavaScript: Core PartJavaScript: Core Part
JavaScript: Core Part
 
Notes1
Notes1Notes1
Notes1
 
Programming For As Comp
Programming For As CompProgramming For As Comp
Programming For As Comp
 
Programming For As Comp
Programming For As CompProgramming For As Comp
Programming For As Comp
 
Questions4
Questions4Questions4
Questions4
 
Getting started with R
Getting started with RGetting started with R
Getting started with R
 
3.5
3.53.5
3.5
 
Algorithm and c language
Algorithm and c languageAlgorithm and c language
Algorithm and c language
 
Loops_in_Rv1.2b
Loops_in_Rv1.2bLoops_in_Rv1.2b
Loops_in_Rv1.2b
 

More from Dorothy Bishop

Exercise/fish oil intervention for dyslexia
Exercise/fish oil intervention for dyslexiaExercise/fish oil intervention for dyslexia
Exercise/fish oil intervention for dyslexiaDorothy Bishop
 
Open Research Practices in the Age of a Papermill Pandemic
Open Research Practices in the Age of a Papermill PandemicOpen Research Practices in the Age of a Papermill Pandemic
Open Research Practices in the Age of a Papermill PandemicDorothy Bishop
 
Language-impaired preschoolers: A follow-up into adolescence.
Language-impaired preschoolers: A follow-up into adolescence.Language-impaired preschoolers: A follow-up into adolescence.
Language-impaired preschoolers: A follow-up into adolescence.Dorothy Bishop
 
Journal club summary: Open Science save lives
Journal club summary: Open Science save livesJournal club summary: Open Science save lives
Journal club summary: Open Science save livesDorothy Bishop
 
Short talk on 2 cognitive biases and reproducibility
Short talk on 2 cognitive biases and reproducibilityShort talk on 2 cognitive biases and reproducibility
Short talk on 2 cognitive biases and reproducibilityDorothy Bishop
 
Otitis media with effusion: an illustration of ascertainment bias
Otitis media with effusion: an illustration of ascertainment biasOtitis media with effusion: an illustration of ascertainment bias
Otitis media with effusion: an illustration of ascertainment biasDorothy Bishop
 
Insights from psychology on lack of reproducibility
Insights from psychology on lack of reproducibilityInsights from psychology on lack of reproducibility
Insights from psychology on lack of reproducibilityDorothy Bishop
 
What are metrics good for? Reflections on REF and TEF
What are metrics good for? Reflections on REF and TEFWhat are metrics good for? Reflections on REF and TEF
What are metrics good for? Reflections on REF and TEFDorothy Bishop
 
Biomarkers for psychological phenotypes?
Biomarkers for psychological phenotypes?Biomarkers for psychological phenotypes?
Biomarkers for psychological phenotypes?Dorothy Bishop
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basicsDorothy Bishop
 
Simulating data to gain insights into power and p-hacking
Simulating data to gain insights intopower and p-hackingSimulating data to gain insights intopower and p-hacking
Simulating data to gain insights into power and p-hackingDorothy Bishop
 
Talk on reproducibility in EEG research
Talk on reproducibility in EEG researchTalk on reproducibility in EEG research
Talk on reproducibility in EEG researchDorothy Bishop
 
What is Developmental Language Disorder
What is Developmental Language DisorderWhat is Developmental Language Disorder
What is Developmental Language DisorderDorothy Bishop
 
Developmental language disorder and auditory processing disorder: 
Same or di...
Developmental language disorder and auditory processing disorder: 
Same or di...Developmental language disorder and auditory processing disorder: 
Same or di...
Developmental language disorder and auditory processing disorder: 
Same or di...Dorothy Bishop
 
Fallibility in science: Responsible ways to handle mistakes
Fallibility in science: Responsible ways to handle mistakesFallibility in science: Responsible ways to handle mistakes
Fallibility in science: Responsible ways to handle mistakesDorothy Bishop
 
Improve your study with pre-registration
Improve your study with pre-registrationImprove your study with pre-registration
Improve your study with pre-registrationDorothy Bishop
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your researchDorothy Bishop
 
Southampton: lecture on TEF
Southampton: lecture on TEFSouthampton: lecture on TEF
Southampton: lecture on TEFDorothy Bishop
 
Reading list: What’s wrong with our universities
Reading list: What’s wrong with our universitiesReading list: What’s wrong with our universities
Reading list: What’s wrong with our universitiesDorothy Bishop
 
IJLCD Winter Lecture 2016-7 : References
IJLCD Winter Lecture 2016-7 : ReferencesIJLCD Winter Lecture 2016-7 : References
IJLCD Winter Lecture 2016-7 : ReferencesDorothy Bishop
 

More from Dorothy Bishop (20)

Exercise/fish oil intervention for dyslexia
Exercise/fish oil intervention for dyslexiaExercise/fish oil intervention for dyslexia
Exercise/fish oil intervention for dyslexia
 
Open Research Practices in the Age of a Papermill Pandemic
Open Research Practices in the Age of a Papermill PandemicOpen Research Practices in the Age of a Papermill Pandemic
Open Research Practices in the Age of a Papermill Pandemic
 
Language-impaired preschoolers: A follow-up into adolescence.
Language-impaired preschoolers: A follow-up into adolescence.Language-impaired preschoolers: A follow-up into adolescence.
Language-impaired preschoolers: A follow-up into adolescence.
 
Journal club summary: Open Science save lives
Journal club summary: Open Science save livesJournal club summary: Open Science save lives
Journal club summary: Open Science save lives
 
Short talk on 2 cognitive biases and reproducibility
Short talk on 2 cognitive biases and reproducibilityShort talk on 2 cognitive biases and reproducibility
Short talk on 2 cognitive biases and reproducibility
 
Otitis media with effusion: an illustration of ascertainment bias
Otitis media with effusion: an illustration of ascertainment biasOtitis media with effusion: an illustration of ascertainment bias
Otitis media with effusion: an illustration of ascertainment bias
 
Insights from psychology on lack of reproducibility
Insights from psychology on lack of reproducibilityInsights from psychology on lack of reproducibility
Insights from psychology on lack of reproducibility
 
What are metrics good for? Reflections on REF and TEF
What are metrics good for? Reflections on REF and TEFWhat are metrics good for? Reflections on REF and TEF
What are metrics good for? Reflections on REF and TEF
 
Biomarkers for psychological phenotypes?
Biomarkers for psychological phenotypes?Biomarkers for psychological phenotypes?
Biomarkers for psychological phenotypes?
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basics
 
Simulating data to gain insights into power and p-hacking
Simulating data to gain insights intopower and p-hackingSimulating data to gain insights intopower and p-hacking
Simulating data to gain insights into power and p-hacking
 
Talk on reproducibility in EEG research
Talk on reproducibility in EEG researchTalk on reproducibility in EEG research
Talk on reproducibility in EEG research
 
What is Developmental Language Disorder
What is Developmental Language DisorderWhat is Developmental Language Disorder
What is Developmental Language Disorder
 
Developmental language disorder and auditory processing disorder: 
Same or di...
Developmental language disorder and auditory processing disorder: 
Same or di...Developmental language disorder and auditory processing disorder: 
Same or di...
Developmental language disorder and auditory processing disorder: 
Same or di...
 
Fallibility in science: Responsible ways to handle mistakes
Fallibility in science: Responsible ways to handle mistakesFallibility in science: Responsible ways to handle mistakes
Fallibility in science: Responsible ways to handle mistakes
 
Improve your study with pre-registration
Improve your study with pre-registrationImprove your study with pre-registration
Improve your study with pre-registration
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
 
Southampton: lecture on TEF
Southampton: lecture on TEFSouthampton: lecture on TEF
Southampton: lecture on TEF
 
Reading list: What’s wrong with our universities
Reading list: What’s wrong with our universitiesReading list: What’s wrong with our universities
Reading list: What’s wrong with our universities
 
IJLCD Winter Lecture 2016-7 : References
IJLCD Winter Lecture 2016-7 : ReferencesIJLCD Winter Lecture 2016-7 : References
IJLCD Winter Lecture 2016-7 : References
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Learning R while exploring illusory correlations

  • 1. Learning R while exploring statistics This exercise is designed to help you learn R while at the same time gaining insights into the phenomenon of illusory correlation. We will go through the following steps: 1. Downloading R and R Studio, an interface to the R programming language that is rather easier to work with than the basic interface. 2. Familiarisation with basic operations in R 3. Generating simulated data: two correlated variables, X and Y 4. Generating simulated data from two groups with different means on uncorrelated variables U and V to demonstrate spurious correlation between U and V. 5. Demonstrating how incorporating group identity in a linear model unmasks the spurious nature of the correlation between U and V. 6. Demonstrating how removing the effect of group will be misleading if group identity is highly dependent on one of the variables. These instructions apply to those working on PC, and I don't know whether equivalent on Mac. For steps 5 and 6 it's assumed you have a basic understanding of simple regression. 1. Downloading R and R Studio Downloading R R is a powerful language for statistical computing, but much of the documentation is written for experts, and so it can be daunting for beginners. If you go to the website: http://www.r-project.org/ You will see instructions for how to download R. Do not be put off by the instruction to "choose your preferred CRAN mirror": this just means you should select a download site from the list provided that is geographically close to where you are. You may then be offered further options that you may not fully understand. Just persevere by selecting the 'windows' option from the "Download and install R" section, and then select 'base', which at last takes you to a page with straightforward download instructions. Installation of R will create a Start Menu item and an icon for R on your desktop. Downloading R Studio To download R studio, go to this website and follow the instructions. http://rstudio.org/ If for any reason you prefer not to use R Studio, the examples should all work from the original R interface, but your screen may look different, and it may be difficult to arrange items such as figures in a sensible way. 2. Familiarisation with basic operations in R After opening R Studio your screen will be divided into several windows. Move your cursor to the window called R console, in which you can type commands. You will see a > cursor. This cursor will not be shown in the examples below, but it indicates that the console is awaiting input from you. At the > cursor, type: help.start() As with other programming languages, you hit Enter at the end of each command. This will open a window showing links to various manuals. You may want to briefly explore this before going further. Just to familiarise yourself with the console, type: 1+2 R evaluates the expression and you see output: [1] 3 th Version 1.1 9 June 2012 1
  • 2. The [1] at the beginning of the output line indicates that the answer is the first row of the variable. This looks confusing if you just have a single number, as in this case, but, as we will see, output can consist of an array of numbers. Now type: x = 1+2 Nothing happens. But the variable x has been assigned, and if you now type x on the console, you will again see the output [1] 3 In R, the results of variable assignments are not shown automatically, but you can see them at any time by just typing the name of the variable. You can also see all current variables in the Workspace screen on the right. The value assigned to variable x will remain assigned unless you explicitly remove it using the 'rm' command. Type: rm(x) You now see that x has disappeared from the If you type x again, the console gives the message: Error: object 'x' not found You can repeat an earlier command by pressing the up arrow until it reappears. Use this method to redo the assignment x=1+2, and then type X. Again you get the error message, because R is case-sensitive, and so X and x are different variables. Now type: y = c(1, 3, 6, 7) The workspace tells you y is a numeric variable with four values, i.e. a vector. To see the values, type y on the console. You will see the vector of numbers [1 3 6 7]. The 'c' in the previous command is not a variable name, but rather denotes the operation of concatenation. It just instructs R to create a variable consisting of the sequence of material that follows in brackets. Now type: x= and hit Enter. The cursor changs to + This is R telling you that the command is incomplete. If you now type 1+2 followed by Enter, your regular cursor returns, because the command is completed. It can happen that you start typing a command and think better of it. To escape from an incomplete command, and restore the > cursor, just hit Escape. The Console is useful for doing quick computations and checking out commands, but in general, when you do computations, you will want to use a script, i.e. a set of commands that you can save, so you can repeat the sequence of operations at any time. The script is written in the Source window (also known as the Editor window). From the menu at the top of the screen select File|New|R script. You will see a new tab in the Source window, labelled Untitled1. You want to save it with a name. Select a name such as Demo1 and type this in Source window, preceded by the symbol #. It is important that the name contains no blank spaces. If you make a script name with blank spaces, this can create havoc later on, because when you try to execute it, R will interpret all but the first word as commands, and you will get misleading error messages that will have you scratching your head as to what they mean. th Version 1.1 9 June 2012 2
  • 3. The hash symbol that you typed before the script name is used to create a comment in a script, i.e. a line that is used to remind the user of important information, but which is not executed when the script runs. It is customary to put the title of the script, plus information about it function, author and date at the head of the script. Select the menu command File|SaveAs to save the script with that name. Currently, your script doesn't do anything. Let's give it some content. In the Source window type: x=2+3 y=4+5 z=x+y Now select the top menu item Edit|Run Code|Run All. As the script executes, you will see the commands in the script repeated in the Console window, and the values of the variables x, y and z in the Workspace window. These variables will remain assigned to these values until explicitly cleared. You can test this by typing a command at the console such as: x-y which will give the answer -4. Important: Traditionally, R scripts use <- instead of =. So, you will see instances of scripts which have commands such as a <- 1+3. This is equivalent to a = 1+3. It is also possible to have the arrow going the other way , i.e., 1+3 -> a, which means the same thing. My view of life is that you should never make two keystrokes when one will do, and so I persist with the use of the equals sign, but R purists disapprove of this. One reason for avoiding = in assigning values to variables, is that it can be confusing, because the equals symbol is also used in other contexts, such as judging whether two things are the same. For the present, I'm not going to worry you further about this, but you may want to squirrel that fact away. Confusion between different uses of the = operator causes much grief, not just in R but in most programming languages. Loops: A loop is a way of repeatedly executing the same code. Suppose we wanted to print out the ten times table, we could type 1*10; 2*10, 3*10, and so on. But a simpler method is to use a loop, where we multiply 10 times a variable, myx, and specify the range of values that myx will take at the start of the loop. Thus we can type in the commands: for (myx in c(1:10)) { print(10*myx) } The first line specifies the values that myx can take, i.e. c(1:10), which is the values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. The program executes all the commands between curly brackets repeatedly, incrementing the value of myx each time it does so, until it gets to the final value, whereupon it exits the loop. Stopping a program: Sometimes a program has been written in a way that it keeps running and never stops. If you need to abort you just type Ctrl+c. Commenting: A good script will contain many lines preceded by # This indicates that the line is a comment – it does not contain commands to be executed, but provides explanation of how the script works. Before you go any further, create a new directory that will contain all of your scripts, data, and workspace for a project. Then go to the menu and select Tools|Set Working Directory|Choose Directory and navigate to your new directory. This means that all your th Version 1.1 9 June 2012 3
  • 4. work will be saved in one place. Whenever you start up R from a file in that directory, it will continue as your working directory. A note on quotes: If you paste a script into your R console or browser, quotes may get reformatted, causing an error. Always check: for R, single quotes should be straight quotes, not 'smart quotes' (i.e. quotes that slant or curl in a different direction at the start and end of a quoted section). You may need to retype them if your system has reformatted them. Further reading The best way to learn R is to play with it. You should try typing in commands to see what happens. Use the R Manuals from the Help screen to get started. In addition, these texts are recommended. Braun, W. J., & Murdoch, D. J. (2007). A first course in statistical programming with R. Cambridge: Cambridge University Press. Crawley, M. J. (2007). The R Book. Chichester, UK: Wiley. Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S, 4th edition. New York: Springer. (Do not be put off by the title: it really should be entitled 'with S and R') 3. Generating simulated data: two correlated variables, X and Y An important aspect of R is the ease with which you can generate simulated data. Playing with simulated data is one of the best ways of gaining an intuitive grasp of statistics. You can create a dataset with certain characteristics, and then see what happens when you analyse it in different ways. Most introductions to statistics ignore the potential of simulated data, and simulation is often seen as an advanced topic. My view is that it should be one of the first things you learn to do. As a first exercise in running a script in R, we shall generate a simulated set of data for two variables, look at some basic statistics for the variables, plot them, and save the data. We will be using the data for more interesting purposes later on, but for the time being, the aim is to familiarise you with some key R commands. In addition, it is very useful to know how to simulate datasets with specific characteristics, as these can be used to check how various analyses work. Unfortunately, most people do find R commands quite daunting, and the command needed to create simulated data will probably look horrific if you are a newbie. Also, in R the help commands are often not all that helpful, as they are written for statisticians. Don't lose your nerve. I shall walk you through it and all will become clear. One of the first things you need to understand about R is that there is a huge number of functions that you can use to carry out various statistical, mathematical and graphic operations, but they aren't all available when you start up R. Many of them are available in 'packages' which you have to specify if you want to use them. There's a nice explanation of how you can find and use packages here: http://ww2.coastal.edu/kingw/statistics/R- tutorials/package.html. We're going to use commands from a package called MASS, which contains functions and datasets from 'Modern Applied Statistics with S' by Venables and Ripley (see Recommended Reading above). All we need to do is to include the following line in our script: require(MASS) Once that command is executed, all the functions from MASS will be available for us to use. When learning R, it's a good idea to run each new command and see what, if anything happens, and whether the workspace changes. If you just highlight one or more commands in the Editor window and then hit the Run button with a green arrow at the top of the window, this just runs that command. If you run the 'require' command as above, the Console just reassuringly tells you it is loading MASS. th Version 1.1 9 June 2012 4
  • 5. Now, we're going to generate two columns of correlated numbers, X and Y. We'll start by creating a variable to hold their names. The next line of your script should be: mylabels=c('X','Y') # Put labels for the two variables in a vector Remember: You could just omit the bit after the hash, which is a comment. It's there to remind you what you are doing. It may be obvious now, but, trust me, it won't be if you come back a week later. You should add your own comments, using language that will be helpful to you. If you run this command you will see that the Workspace shows mylabels as a character variable with two values. It knows to treat mylabels as a character, rather than number variable because you have enclosed the labels in quotes. Does it matter if you use single or double quotes? I couldn't remember, so I just tried making a different variable by typing a command on the console with double quotes - you should do the same. It's always good practice to just play around with commands and see what happens. We are going to use a fancy command from MASS called mvrnorm. It's not uncommon to forget the precise format that a commands need, but help is at hand. On the console type help(mvrnorm), and you will find that the Help screen shows you the way the command is used. It first tells you what the arguments are for the command, i.e. the things you need to specify to make it work, it then terrifies you with a more technical explanation, and finally gives a worked example. The worked example may be helpful or may just baffle you completely. Let's look at mvrnorm. The help screen starts as follows: mvrnorm(n = 1, mu, Sigma, tol = 1e-6, empirical = FALSE) and then gives an account of what each argument is. n the number of samples required. mu a vector giving the means of the variables. Sigma a positive-definite symmetric matrix specifying the covariance matrix of the variables. tol tolerance (relative to largest variance) for numerical lack of positive-definiteness in Sigma. empirical logical. If true, mu and Sigma specify the empirical not population mean and covariance matrix. Thus the first things you need to specify are the number of cases to simulate (n), the mean of the variable, and the covariance matrix. We are going to be working with z-scores, to make life easier. Remember that for z-scores, a correlation is equivalent to a covariance, and the SD and variance are both equal to 1. We first specify the correlation that we want: myr = .5 Add that to your script, and run it, so that we have a value in myr. For sigma, we need to specify the following matrix: [1 myr myr 1] In R, you can create a matrix using the c (concatenate) command, but if you just typed c(1, myr, myr, 1), then this wouldn't work. Why not? Try typing this at the console and see. You'll find you have the right numbers, but they aren't in a 2x2 matrix. To get them properly arranged, you need to explicitly specify that you want a matrix with two rows and two columns. So the full command is mysigma = matrix(c(1,myr,myr,1),2,2) th Version 1.1 9 June 2012 5
  • 6. The last two numbers in the command indicate we want 2 rows and 2 columns. Look at mysigma. You could then try making another matrix, but with 1, 4 rather than 2, 2 at the end. I can't stress enough that to understand commands, you just have to try them out. If you aren't sure how something works, tweak a command and see what happens. Note - there's nothing to stop you typing .5 rather than myr in the command above. It will give the same answer. But we want a flexible script that will allow us to play around and look at different values of the correlation, and if we use the variable myr in the code, rather than a specific value, this allows us to do that easily. So all you now need is to specify the number of cases and the mean values for X and Y. We do this with the commands: myn=50 # we're going to create 50 rows of data mymean=c(0,0) #means are zero for both X and Y We are now ready to go! What about the other arguments, tol and empirical? They are optional and we'll leave them alone for the moment, though we will look at empirical later on. We need a variable name for our simulated data. Let's call it myarray. So we type: myarray=mvrnorm(n=myn, mymean,mysigma) Now run the whole script. Each command is reflected in the console as it executes. But where are the results? The workspace now confirms that you have created myarray which is a matrix with 50 rows and 2 columns. To look at the results, just type myarray on the console. There are your 50 paired z-scores! Before going further, I'll just explain why I've created variables that all start with 'my'. This is not essential, but it's a fairly common method. It has the advantage that you are unlikely to inadvertently use a variable name that corresponds to an existing R command, and when reading a script it makes it generally easier to distinguish your variables from other parts of R language. We have created paired variables, but they aren't yet labelled. Assigning column names to a matrix in R is easy. Remember, we created mylabels earlier. We can assign these as our column names as follows: colnames(myarray)=mylabels So now you have built up a whole script to generate paired numbers, which looks like this: # simulate_XY # Script to simulate z-scores X and Y, with specific correlation require(MASS) # Load functions from Modern Applied Statistics for S mylabels=c('X','Y') # Labels to be used later for our variables myr=.5 # Correlation (can be changed) mysigma = matrix(c(1,myr, myr,1),2,2) # 2 x 2 ccovariance matrix # (with zscores, equiv to correlation matrix) myn=50 # N rows of data to simulate mymean=c(0,0) # Means for each variable (zero for zscores) myarray=mvrnorm(n=myn, mymean,mysigma) # create array of simulated data colnames(myarray)=mylabels #Assign labels to columns of simulated data But you may be suspicious. How do you know that the numbers you have generated have mean of zero and are actually correlated .5? You can use R commands to find out. This command gives you a range of descriptive statistics, including the means: summary(myarray) th Version 1.1 9 June 2012 6
  • 7. and this one gives the correlation matrix: cor(myarray) At this point, you may start to think (depending on your locus of control) either that you have done something wrong, or that R is not very good. It's highly likely that your means will differ from zero, and the correlation will be smaller or bigger than .5. The reason is that we did not specify empirical = TRUE. R has faithfully generated a sample of observations from a population of values where the true correlation is .5, but because of sampling error, the observed value in this sample is likely to deviate from .5. If you re-run the program, but this time alter the mvrnorm command to: myarray=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE) then you will find the means are zero (or, more likely a real number that is infinitesimally small) and the correlation is .5. Alternatively, you could remove the empirical command (or specify empirical=FALSE, which has the same effect), but specify n = 50000, or another very large number. The larger the sample you take from the population, the closer the sample correlation will approach to the population correlation. It's always a good idea to plot data as well as looking at summary statistics. To see a scatterplot of your data, add this command to your script: plot(myarray) A graph will now pop up in the Plots tab of the right hand lower window. Finally, you might want to save your simulated data so you can use them at a later time. This command will write a data file to your current directory: write.table(myarray,"mysimdata") If you want to get your data back on another occasion, this command will read the saved data into a matrix called newdata newdata=read.table("mysimdata") The mvrnorm command uses a random number generator, which means that each time you run the script, different numbers will be generated. If you want to always get the same numbers, you can do so by just specifying a 'seed' for the random number generator. This can be any number, but provided it is the same number each time, you'll get the same result. Just put this command somewhere before the mvrnorm command: set.seed(2) If you have started from scratch and got this far, then you should take a break and reward yourself with a cup of coffee or whatever other substances hit the spot for you. 4. Generating simulated data from two groups with different means on uncorrelated variables U and V We're now going to apply what we've learned to generate data from two separate groups on two variables that are uncorrelated. The only difference is that the means differ on both variables for the two groups. Let's set means for X and Y for group A as -1 and for group B they'll be 1. We'll generate 60 cases for each group. We'll call these datasets myarrayA and myarrayB. If you've followed what we've done so far, you should be able to work out how to do this. It will be a good exercise to try, as you learn R by thinking it through, rather than by just copying. But I'll give you a script to do it anyway, in case you get stuck: #demo_spurious_corr_script require(MASS) #Load functions from Modern Applied Statistics for S mylabels=c('U','V') myr=0 #U and V are uncorrelated, and so r is set to zero mysigma = matrix(c(1,myr, myr,1),2,2) th Version 1.1 9 June 2012 7
  • 8. myn=60 set.seed(3) #Array for group A mymean=c(-1,-1) #mean zscore for group A myarrayA=mvrnorm(n=myn, mymean,mysigma) #Generate uncorrelated U and V for grp A colnames(myarrayA)=mylabels summary(myarrayA) cor(myarrayA) plot(myarrayA) #Array for group B mymean=c(1,1) #mean zscore for group B myarrayB=mvrnorm(n=myn, mymean,mysigma)#Generate uncorrelated U and V for grp B colnames(myarrayB)=mylabels summary(myarrayB) cor(myarrayB) plot(myarrayB) We now want to combine the two arrays into one long column, and call this combined array with a new name, myarrayAB. This can be achieved with a single command for concatenating rows, as follows: myarrayAB=rbind(myarrayA,myarrayB) We can then look at the correlation for the combined groups: cor(myarrayAB) Even though the correlation within either group was set to zero, the correlation for the combined groups is around .5 and highly significant. This is the phenomenon of spurious correlation. To make it more concrete, consider if U and V were height and chest hairiness and groups A and B were males and females. Since men tend to be taller and hairier than women, you could find a spurious correlation between height and hairiness in a combined group, even though they are uncorrelated within either sex. One reason I like simulations is that they can give you new insights into such phenomena. Note that we specified massive mean differences between our groups: one group with a mean z-score of +1 and the other with mean z-score of -1. When I first attempted this simulation, I used much smaller group differences, and was surprised at how hard it was to generate a spurious correlation. With a simulation like this, you can play around and get a good feel for the phenomenon by repeatedly generating datasets with different values. The phenomenon of spurious correlation is a source of major concern, especially for those interested in correlational data, but my impression is that its importance may have been overemphasised, because in practice it doesn't become a problem except in quite extreme situations where you have two groups with very different mean values. 5. Demonstrating how incorporating group identity in a linear model unmasks the spurious nature of the correlation between U and V Let us stick with the interpretation of our simulated data as representing height and hairiness in males and females (ignoring the fact that the group mean differences are vastly greater than would be realistic). We now need to add to our combined dataset another column that specifies gender. The R command rep will just create a vector of repeated numbers. We make a set of 60 values = .5 for males, and 60 values = -.5 for females. The reason for picking these specific values is because it helps interpretation of regression output if we set the average for two groups to zero and make the mean difference between them equal to one. However, it's not th Version 1.1 9 June 2012 8
  • 9. essential to do this, and you could have picked other numbers, such as 0 and 1 to indicate group identity. males=rep(.5,myn) #Create vector with myn repetitions of value .5 females=rep(-.5,myn) #Create vector with myn repetitions of value -.5 Having made our two sets of numbers, we then join them together in a variable called mygender as follows: gender=c(males, females) Run these commands and then type gender at the console to check the result. All that is now needed is to bolt this column on to our existing myarrayAB, which we can do with a single command for concatenating columns, cbind. myarrayAB=cbind(gender,myarrayAB) Note that I have created a lot of intermediate variables in the course of generating myarrayAB. This is unnecessary and uses up memory. It would be possible to combine several steps in one command and so avoid creating the intermediate variables. However, when learning R, I think it is helpful to break commands down into small steps and create new variables, as this allows you to see the logic of what is being done, and to check the values of each variable. It also makes your scripts easier to understand when you come back to them later. Very experienced programmers may write much more compact code than this, but with modern computers, memory is seldom a problem unless working with very large data arrays, and so, apart from demonstrating how clever you are, compact code doesn't serve much function. We now want to do a regression analysis. We will start with simple regression of V on U for the combined group data. R has many powerful commands for doing regression, but it requires that the data are formatted in what is called a data frame. Fortunately, this transformation is trivially easy: we just add the command: mydata=data.frame(myarrayAB) Commands for regression in R are formulated in terms of the general linear model. This is a very general and flexible approach to statistical analysis that readily incorporates the more traditional methods beloved of psychologists such as analysis of variance. However, I suspect that many psychologists reading this won't find it a very intuitive way to think about data, and it takes a while to map the R commands onto pre-existing statistical knowledge. The other thing that can be puzzling is that with programs such as SPSS, we are used to running a command and then looking at the output screen. Although R can be used in an analogous way, it is more usual to write the results to another variable. The variable that holds the results is likely to be a fairly complex structure, as we shall see. But the basic idea is that you don't just use a command to do the analysis: you actually specify a name for the output of the analysis. The simplest form of regression is pretty easy. The command lm just stands for linear model, and requires two obligatory commands: you have to specify a formula that indicates the relationship between predicted and predictor variables, and specify the dataset used to estimate regression coefficients. So let's illustrate this with our U and V variables. Add this command to the script: myreg1=lm(V~U,mydata) and then inspect the myreg1 variable that is created. This contains two coefficients, an intercept, that is close to zero, and a slope, that is close to 0.5. Note that when you type lm you also get information about the formula used to generate the coefficients, labelled call. The output of lm contains a complex set of varied information in a th Version 1.1 9 June 2012 9
  • 10. structure. If you want to look at just part of the structure, you have to use the $ sign to indicate which bit. Try this, by just typing at the console: myreg1$call and myreg1$coef You will see that the portion after the $ indicates which bit of the myreg1 structure is referred to. The term V~U tells the program to fit a straight line according to the formula: V = b1 + b2.U where b1 is a slope and b2 is an intercept. It is these slopes and intercepts that are then generated when the lm command is executed. We can use these outputs to plot the regression line. First plot the raw data. This command will achieve that: plot(U~V, mydata) The command abline plots a straight line with a given intercept and slope. You could add a straight line through the intercept zero and with slope of 1, as follows: abline(0,1) The regression line is simply the straight line with intercept and slope corresponding to the computed regression coefficients, and so can be plotted just by typing: abline(myreg1$coef) The lty command allows you to specify the type of line you want. This command will make the regression line a dotted line. abline(myreg1$coef,lty=5) As an aside here, I haven't used R very much, and when I first saw a command with lty I was confused and thought it was some kind of variable. This is, in my experience, a common difficulty with R. Various letter sequences that look like variables or functions, aren't. What did I do? I Googled "R lty" and immediately all became clear. Perhaps the single most important advice if you want to learn R is to just use Google if you get stuck. We now want to look at the regression with gender included. A simple modification to the syntax achieves this. We have taken care to code gender so that the sum of the two gender codes is zero, and we can include it in the linear model, even though it is a categorical variable. Here is the command: myreg2=lm(V~U+gender,mydata) This corresponds to the regression equation: V = b1 + b2.U + b3.gender If we type myreg2, we see that the output now has one intercept and two regression coefficients, like this: (Intercept) U gender 0.03357 0.04529 1.68913 Your values may differ from this because the simulated data will be different, but the overall pattern will be similar. Note that the regression coefficient associated with U is now close to zero, whereas that associated with gender is much bigger. Once we have run the model we can get much more detailed statistical output by requesting a summary, as follows: summary(myreg2) th Version 1.1 9 June 2012 10
  • 11. Now we have not only the coefficients, but their standard errors, associated t-values and significance levels. This confirms that gender is a substantial predictor of V, and U is not. Finally, you can use the anova command to produce an anova table comparing the fit of the two models: anova(myreg1,myreg2) I've learned a lot about using R for regression analysis from this site. It also has information on how to do diagnostic plots, for instance. However, for the present, I won't get diverted into that, but will rather press on to look at what happens if you have groups defined on a variable that is highly correlated with one of the dependent variables. 6. Demonstrating how removing the effect of group will be misleading if group identity is highly dependent on one of the variables. You should by now be able to follow this script, which is heavily commented to explain each step. This time we are going to generate a multivariate normal distribution with 3 variables. Two of them, L1 and L2 are language measures and A is an auditory measure. The language measures show moderate correlation with the auditory measure and are highly intercorrelated with one another. Group identity (control or language impaired, .5 or -.5) is defined in terms of whether the score on L1 is above z-score of -1 or not. This, then, is analogous to the case of dyslexia or language impairment, where we define whether or not the child has the diagnosis on the basis of a low test score. In a case like this, removing the effect of group can abolish the relationship between L2 and A, simply because L1 and L2 are highly intercorrelated. It would be quite wrong to conclude from this that L2 and A are not related. #demo_spurious_corr_script3 # Using a group variable that is highly correlated with one variable # With these settings, gives the result that by including SLI category # you remove influence of L2 require(MASS) #Load functions from Modern Applied Statistics for S mylabels=c('L1','L2','A') #3 variables, two language and one auditory myr=.8 #correlation between the language measures myr2=.3 #correlation of both language measures with auditory mysigma = matrix(c(1,myr,myr2, myr,1,myr2, myr2,myr2,1),3,3) myn=60 set.seed(6) #change or comment out this line to get different set of estimates mymean=c(0,0,0) #Means for L1, L2, and A are zero myarray3=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE) colnames(myarray3)=mylabels summary(myarray3) cor(myarray3) myL1=myarray3[,1] #first column # Now determine which cases are control or SLI and put in mygroup variable mygroup=rep.int(-1,myn) #default is SLI, coded -1 mycon=which(myL1> -1) #row index of those with L1 in con range mygroup[mycon]=1 These rows are assigned group code of 1 (control) myarrayAB=cbind(mygroup,myarray3) #add mygroup to the data array mydata=data.frame(myarrayAB) #Regression with only Group included myreg1=lm(A~mygroup,mydata) th Version 1.1 9 June 2012 11
  • 12. summary(myreg1) #Regression with both group and L2 included myreg2=lm(A~L2+mygroup,mydata) summary(myreg2) anova(myreg1,myreg2) #Regression if we exclude group ID myreg3=lm(A~L2,mydata) summary(myreg3) The point I want to make with this simulation is that if we want to 'take out' the effect of group identity from a correlation, then we need to think carefully about the logic of what we are doing. In the previous example of spurious correlation, we defined gender quite independently of our two measures, height and hairiness. Although males and females differed substantially on both measures, their gender was not determined by those measures. In any logical causal route, we can confidently treat gender as a primary cause, and so it makes sense to 'take out' its effect. For certain developmental disorders (and indeed other conditions), the causal route is much less certain, because the disorder is diagnosed on the basis of measured variables. So, for instance, dyslexia is defined in terms of low scores on reading measures. In the simulation above, we looked at correlation between L2 and A, and defined our disorder in terms of L1 - which was highly correlated with L2. We could have defined dyslexia in terms of L2 - you might like to try that: it will achieve a similar effect. The results we got from our simulation are actually sensible, but there is a danger they will be misinterpreted. What they are actually telling us is that language measures and auditory measures are significantly correlated, and this is evident regardless of whether we use a categorical language measure, where group identity is determined by cutoff on a test, or a quantitative measure. What this analysis is defintely not saying is that the correlation between language and auditory measures is spurious. It's possible to imagine a situation where you could have a spurious association with these kinds of variables. For instance, poor social environment may affect both language measures and auditory measures. To show that, we'd need to incorporate a measure of social environment in our regression analysis. But the bottom line is that if we want to argue an association between variables X and Y is spurious, we must have a third variable, Z, that is (a) measureable and (b) not dependent on X or Y. Z may be highly correlated with X and Y: that's not a problem. The problem is when Z is determined by X or Y. th Version 1.1 9 June 2012 12