Very basic introduction to simulating data to illustrate issues affecting reproducibility. Uses Excel and R, but assumes no prior knowledge of R. Please let me know of errors or things that need better explanation.
Recombination DNA Technology (Nucleic Acid Hybridization )
Simulating data to gain insights intopower and p-hacking
1. Simulating data to gain insights into
power and p-hacking
Dorothy V. M. Bishop
Professor of Developmental Neuropsychology
University of Oxford
@deevybee
2. Before you get started….
• The early exercises in this lesson use Microsoft Excel,
which most people will have installed
• The later exercises use R and R studio. This is free
software. If you don’t have it, you’ll need to download
it. As this can take time, it’s recommended that you do
that before you go further.
• Please follow instructions on the next slide.
3. Installing R
• Open an internet browser and go to www.r-project.org.
• Click the "download R" link in the middle of the page under
"Getting Started."
• Click on the link for a CRAN location close to you
• Mac users:
• Click on the "Download R for (Mac) OS X" link at the top of the page.
• Click on the file containing the latest version of R under "Files."
• Save the .pkg file, double-click it to open, and follow the installation
instructions.
• Windows users:
• Click on the "Download R for Windows" link at the top of the page.
• Click on the "install R for the first time" link at the top of the page.
• Click "Download R for Windows" and save the executable file
somewhere on your computer.
• Run the .exe file and follow the installation instructions.
4. Installing R studio
• R studio is a friendly interface for R. Once it is installed, you
need not open the original R software: instead, you access
R by opening the R studio application
• Go to www.rstudio.com and click on the "Download RStudio"
button.
• Click on "Download RStudio Desktop."
Mac users:
• Click on the version recommended for your system, or the latest
Mac version, save the .dmg file on your computer, double-click it
to open, and then drag and drop it to your applications folder.
Windows users:
• Click on the version recommended for your system, or the latest
Windows version, and save the executable file. Run the .exe file
and follow the installation instructions.
5. Why invent data?
• If you can anticipate what your data will look like, you
will also anticipate a lot of issues about study design
that you might not have thought of
• Analysing a simulated dataset can clarify what is
optimal analysis/ how the analysis works
• Simulating data with an anticipated effect is very
useful for power analysis – deciding what sample size
to use
• Simulating data with no effect (i.e. random noise) gives
unique insights into p-hacking
6. Ways to simulate data
• For newbies: to get the general idea: Excel
• Far better but involves steeper learning curve: R
• Also (but not covered here) options in SPSS and
Matlab:
• e.g. https://www.youtube.com/watch?v=XBmvYORP5EU
• http://uk.mathworks.com/help/matlab/random-number-
generation.html
7. Basic idea
• Anything you measure can be seen as a
combination of an effect of interest plus random
noise
• The goal of research is to find out
• (a) whether there is an effect of interest
• (b) if yes, how big it is
• Classic hypothesis-testing with p-values is simply
focuses just on (a) – i.e. have we just got noise or
a real effect?
• We can simulate most scenarios by generating
random noise, with or without a consistent added
effect
8. Basic idea: generate a set of random numbers in Excel
• Open a new workbook
• In cell A1 type random number
• In cell A2 type = rand()
Grab the little
square in the
bottom right of A2
and pull it down to
autofill the cells
below to A8
9. Random numbers in Excel, ctd
• You have just simulated
some data!
• Are your numbers the
same as mine?
• What happens when
you type rand() in
A9?
10. Random numbers in Excel, ctd.
• Your numbers will be different to mine – that’s because they
are random.
• The numbers will change whenever you open the worksheet,
or make any change to it.
• Sometimes that’s fine, but for this demo we want to keep
the same numbers. To control when random numbers
update, select Manual in Formula|Calculation Options.
• To update to new numbers use Calculate Now button.
11. Random numbers in Excel, ctd.
• The rand() function generates random numbers between 0 and 1:
Are these the kind of numbers
we want?
12. Realistic data usually involves normally distributed numbers
• Nifty way to do this in Excel: treat generated numbers as p-values
• The normsinv() function turns a p-value into a z-score
Z-score
13. Normally distributed random numbers
Try this:
• Type = normsinv(A2) in
cell B2
• Drag formula down to
cell B8
• Now look at how the
numbers in column A
relate to those in
column B.
NB. In practice, we can generate normally distributed random numbers
(i.e. z-scores) in just one step with formula: = normsinv(rand())
14. Now we are ready to simulate a study where we have
2 groups to be compared on a t-test
• Pull down the
formula from
columns A
and B to
extend to
A11:B11
• Type a header
‘group’ in C1
• Type 1 in
C2:C6 and 2
in C7:C11
15. What is formula for t-test in Excel?
Basic rule for life, especially in programming: if you don’t know it,
Google it
TTEST formula in xls:
You specify:
Range 1
Range 2
tails (1 or 2)
type
1 = paired
2 = unpaired equal variance
3 = unpaired unequal variance
16. Try entering the formula for the t-test in C12
=TTEST(B2:B6,
B7:B11,2,2)
What is the number
that you get?
This formula gives
you a p-value
Now press
‘calculate now’ 20
times, and keep a
tally of how many
p-values are < .05 in
20 simulations
17. • What has this shown you?
• P-values ‘dance about’ even when data are entirely
random
• On average, one in 20 runs will give p < .05 when null
hypothesis is true – no difference between groups
See Geoff Cumming: Dance of the p-values
https://www.youtube.com/watch?v=5OL1RqHrZQ8
Congratulations! You have done your first simulation
18. We’ll stick with Excel for one more simulation
• So far, we’ve simulated the null hypothesis - random
data. If we find a ‘significant’ difference, we know it’s a
false positive
• Next, we’ll simulate data with a genuine effect.
• It’s easy to do this: we just add a constant to all the
values for group 2
• Since we’re using z-scores, the constant will correspond
to the effect size (expressed as Cohen’s d).
• Let’s try an effect size of .5
• For cells B7, change the formula to = normsinv(A7)+.5
• Drag the formula down to cell B11 and hit ‘Calculate
now’
19. I’ve added formulae to
show the mean and SD for
the two groups:
= AVERAGE(B2:B6)
= STDEV(B2:B6)
= AVERAGE(B7:B11)
= STDEV(B7:B11)
Your values will differ.
Why isn’t the difference in
means for the two groups
exactly .5?
20. I’ve added formulae to
show the mean and SD for
the two groups:
= AVERAGE(B2:B6)
= STDEV(B2:B6)
= AVERAGE(B7:B11)
= STDEV(B7:B11)
Your values will differ.
Why isn’t the difference in
means for the two groups
exactly .5?
ANSWER: mean/SD
describe the population;
this is just a sample from
that population
21. Now add the formula
for the t-test
Is p < .05 ?
It’s pretty unlikely
you will see a
significant result.
Why?
22. Now add the formula
for the t-test
Is p < .05 ?
It’s pretty unlikely
you will see a
significant result.
Why?
ANSWER: Sample too
small – can’t pick out
signal from noise
23. • The first simulation gave some insights into false positive
rates: it shows how you can get a ‘significant’ result from
random data
• The second simulation illustrates the opposite situation:
showing how often you can fail to get a significant p-value,
even when there is a true effect (false negative)
• This brings us on to the topic of statistical power: the
probability of detecting a real effect with a given sample size
• To build on these insights we need to do lots of simulations,
and for that it’s best to move to R (which hopefully you have
already installed: if not see slides 2-3)
What have we learned so far?
24. Benefits of simulating data in R
• Can write a script that executes commands to generate data
and then run it automatically many times and store results
• Much faster than Excel, and reproducible
• Can generate different distributions, correlated variables, etc.
• Powerful plotting functions
• A good way of starting to learn R
Downside: Steep initial learning curve
But remember: Google is your friend
Tons of material about R on the internet
Ready? Create a folder to save your work and fire up R studio!
25. Self-teaching scripts on https://osf.io/skz3j/
Download, save and open this one: Simulation_ex1_intro.R
Source pane: script Console: try commands out here Environment:
check variables here
26. First thing to do: Set working directory
• Working directory is where R will default to when reading and
writing stuff
• Easiest way to set it: Go to Session|Set working directory
Note that when you do this, the command to set working directory will pop up on the
console. On my computer I see:
setwd("~/deevybee_repo")
27. Now we’ll go through the script: it will generate same
type of 2-group data as we’ve done in the 2nd
exercise in Excel
Preliminaries: Install packages. Use Tools|Install Packages
• Remember! A common reason for R code not to work is because you have not
installed a package that you need.
• After installing the package you have to use the library or require
command in your script to load it for this session.
28. To run the code in lines 41-49…
• Select the lines of code
• Click on the Run button in the top bar
• Check what happens in the console
Running a script line by line is a good way to learn R
29. Now start simulating data!
• rnorm is an inbuilt R function that generates
random normal deviates
Now run lines 56-68
30. Now start simulating data!
• rnorm is an inbuilt R function that generates
random normal deviates
• Note that as well as results you specify being
shown on the console, any variables you create are
now featured in the environment pane
Now run lines 56-68
31. Think about questions on lines 72-74
• If you’re confused, remember what you’ve been
taught in basic statistics (I hope!) about the
differences between a population and a sample.
• The mean/SD we specify determines
characteristics of the population from which we
are sampling.
See also:
http://deevybee.blogspot.com/2017/12/using-
simulations-to-understand.html
32. Now we’ll run lines 79-91 to generate data for
another group with different mean
• If our scores are z-scores and the mean for group 1 is zero, then myM2
corresponds to Cohen’s d measure of effect size.
• The final command creates interesting output on the console: results of a
Welch 2-sample t-test (i.e. t-test with correction for unequal variances)
33. Advantages of R over Excel
• Can easily regenerate the data from the
script
• Very easy to change one parameter and
generate a new dataset
• We will see shortly how to repeatedly run a
simulation and store results by using a loop
• But first we’ll do some data reformatting
and show a neat way of plotting the results
34. Making a data frame
• A data frame as a way of storing the data that is rather like an Excel worksheet
• You can store observations in rows and variables in columns
• Data frames are versatile and can hold different variable types
• We’ll put our newly created vectors into a data frame, mydf, with columns for
group and score
• We can easily view mydf by clicking on mydf in the Environment tab
35. Filling the data frame
You can refer to a specific cell in a data frame with the row and column index
e.g. mydf[3, 2] refers to 3rd row and 2nd column. Note square brackets here
You can refer to a whole column by using $ and its name, e.g.mydf$Group
You can also refer to a specific row of a named column, e.g. mydf$Group[3]
Run lines 117-125
36. Deconstructing the t-test result
• One reason for making a data frame is that there are many functions in R that
operate on data frames.
• One of these is the pirateplot function from the yarrr package. This creates a
nice kind of plot called a pirate plot, which shows the distribution of individual
data points as well as other summary statistics. We want to make a pirate plot
with a header that shows the t-test result
• Run line 131: myt <- t.test(myvectorA,myvectorB)
The comments explain this more, but basically you can extract bits of the output
in myt using $. If you type in the console:
myt$
A menu pops up showing you which parameters there are.
Now run lines 145-149, which show how you can bolt together bits of output from
the t-test to make a useful header for a plot
37. Make a pirate plot
• Run line 151:
• pirateplot(Score~Group,data=mydf,main
=myheader, xlab="Group",
ylab="Score")
Your plot will be different from this because we are
generating random numbers that vary on each run.
The pirate plot is not a well-known type of graphic ;
this is a perfect opportunity to practice Googling to learn
more about it – you should try varying the script to see
how you can affect the graph
38. Some general points to help you learn R
1. Basic rule for life, especially in programming: if you don’t know
it, Google it
In R, Google your error message
2. Best way to learn is by making mistakes
If you see a line of code you don’t understand, play with it to find
out what it does.
Look at Environment tab, or type name of variable on the console
to check its value
E.g., you want repeating numbers? Type in the console to
compare: rep (1,3) and rep (3,1)
39. Pause to play with the script.
Make a note of any questions
40. Simulation_ex1_multioutput.R
This is essentially the same as the previous script, except that:
• The plots are sent to a pdf rather than being output on the Plots pane (see
comments in the script for explanation)
• You run the simulation repeatedly, with two different values for N
The structure of the script is with 2 nested loops:
for (i in 1:2){ #line 15
……… #various commands here
for (j in 1:10){ #line 21
……… #various commands here
}
}
• The first loop runs twice; the second loop, which is nested inside it, runs 10 times.
So overall there are 20 runs
• The value,i,in the first loop, controls sample size which is either myNs[1] or
myNs[2]
• The value, j, in the second loop just acts as a counter, to ensure that there are 10
repetitions
41. Run the whole script!
Click on the Files tab in the bottom right-hand pane, and
you’ll see you have created two new pdf files (you may
need to scroll down to see them):
Look at these files, paying particular attention to the proportion
of runs where p < .05.
42. 10 runs of simulation with N = 20 per group and effect size (d) = .3
** *
*
43. 10 runs of simulation with N = 100 per group and effect size (d) = .3
**
* * **
* *
44. Points to note
• Smaller samples associated with more variable results.
• With small sample sizes, true but weak effects will usually
not give you a ‘significant’ result (i.e. p < .05).
• In the example here, with effect size of .3, sample of 100
per group only gives a significant result on around 60% of
runs.
• This is the same as saying the power of the study to
detect an effect size of .3 is equal to .60%
• Many statisticians recommend power should be 80% or
more (though will depend on purpose of study).
45. Body of table show sample size per group
Jacob Cohen worked this all out in 1988
46. Estimating statistical power for your study
For simple designs can use G-power package (or Cohen’s
formulae)
For more complex designs, simulation is a better approach, -
just run the analysis on simulated data 10,000 times and
then see how frequently your result is ‘significant’ by
whatever criterion you plan to use.
This requires you to have a sense of what your data will look
like, and you have to have an estimate of what is the
smallest effect size that you’d be interested in.
47. “Small studies continue to be carried out
with little more than a blind hope of
showing the desired effect. Nevertheless,
papers based on such work are submitted
for publication, especially if the results
turn out to be statistically significant.”
Weak statistical power has been, and continues to be a
major cause of problems with replication of findings
1987
Newcombe
49. P-hacking and type 1 error (false positives)
Load simulation_ex2_correlations.R
Often studies have multiple variables of interest.
This script shows you how to use the mvrnorm function
from the MASS package to simulate multivariate normal data
It also demonstrates the dangers of p-hacking
First just ensure the necessary packages are installed and
load them using library(): run lines 11-14
50. Introduction to mvrnorm
In R, if you want to know how to use a function, you can just type help,
e.g. help(mvrnorm)
But often the official help information is technical and unfriendly and you may find
more useful and accessible information and examples by Googling.
The essential arguments for mvrnorm are the sample size (n), mu, which is a vector
of means (one per variable), and Sigma, a matrix showing the correlations between
variables. We’ll ignore the other arguments provided on the Help page for this demo.
To make life easy, we will again create z-scores for our data, so mean will be zero and
SD = 1.
We can set nvar to 7 and then specify mu = rep(0, nvar).
You could just have mu = rep (0,7)
or even mu = c(0,0,0,0,0,0,0)
But a good script avoids hardcoding variables like this: you want to be able to try
running the script with a range of values, and it’s much easier just changing the initial
definition of nvar than retyping all the lines of code that use nvar.
51. Specifying covariance between variables
You should be familiar with the correlation coefficient, r
If we are using z-score and have r = .5, what is covariance?
52. Creating the covariance matrix
• One benefit of using z-scores is that the covariance matrix is the same as the
correlation matrix, so if we specify the amount of correlation between variables,
then we can easily make the covariance matrix that we need.
• For simplicity, we’ll just assume that all our simulated variables are
intercorrelated by the same amount, a value we’ll call myCorr.
• So, if we have 7 variables, and myCorr is 0, we will need a matrix like this:
• The script achieves this just by making a matrix where all values = myCorr, and
then overwriting the diagonal with myVar (which we’ve set to 1)
• N.B. The script is set to simulate uncorrelated variables, so off-diagonal values are
0, but you could experiment with other values, by changing myCorr
53. Running mvrnorm
Before starting, it’s a good idea to clear all variables: R does not do that
automatically, and it can be problematic if you still have values of variables from an
earlier session. To clear them all, click the little broom symbol on the Environment
tab.
Now run all the lines of the script up to and including line 51.
Check the Environment tab, which will show all the variables you have created.
Skip over the command on line 60 for the moment.
That line starts a loop and if you try to run it, the system will hang waiting for a close
curly bracket to match the open curly bracket. (You can get out of that by either
hitting escape or typing a close curly bracket on the console).
For now, just run line 69
mydata<-mvrnorm(n=myN, mu=rep(myM,nVar), Sigma=myCov)
54. As with Excel simulation, the script generate fresh set of numbers
on each run, though we can modify the settings to override this.
(Google ‘setting seed’ in R)
First six rows of mydata look like this:
55. Now we can analyse the simulated data!
Let’s look at correlations between the seven variables
Pick your favourite variables by selecting two numbers
between 1 and 7
Thought experiment: We’ve simulated uncorrelated variables.
In a single run, how likely is it that we’ll see:
• No significant correlations
• Some significant correlations
• A significant correlation (p < .05) between your favourite
variables
56. Correlation matrix for run 1
Output from simulation of 7 independent variables, where true correlation = 0
N = 30
Red denotes p < .05 ( r > .31 or < -.31);
57. Correlation matrix for run 2
Output from simulation of 7 independent variables, where true correlation = 0
N = 30
Red denotes p < .05 ( r > .31 or < -.31);
58. Correlation matrix for run 3
Output from simulation of 7 independent variables, where true correlation = 0
N = 30
Red denotes p < .05 ( r > .31 or < -.31);
There is no relation between variables – why do we have
significant values?
59. Correlation matrix for run 4
Output from simulation of 7 independent variables, where true correlation = 0
N = 30
Red denotes p < .05 ( r > .31 or < -.31);
On any one run, we are looking at 21 correlations.
So we should use Bonferroni corrected p-value: .05/21 = .002,
corresponds to r = .51
60. Now try to work through the script yourself
• You can run the script to generate your own table of results (it is
set up just to show the table for the final run).
• The bit of the script for generating tables showing significant p-
values in colour is complex: don’t worry if you don’t understand
it.
• Most important thing is that you should develop competence to
play around with the script and see how the output changes
depending on how you change the sample size, the number of
variables, and the true correlation between variables.
61. • Use of .05 cutoff makes sense only in relation to an a-priori
hypothesis
Many ways in which ‘hidden multiplicity’ of testing can give false
positive (p < .05) results
• Data dredging from a large set of variables
• Multi-way Anova with many main effects/interactions
• Cramer, A. O. J., et al (2016). Hidden multiplicity in exploratory multiway ANOVA:
Prevalence and remedies. Psychonomic Bulletin & Review, 23(2), 640-647.
doi:10.3758/s13423-015-0913-5)
• Trying various analytic approaches until one ‘works’
• Post-hoc division of data into subgroups
In latter 2 instances, may be hard to estimate appropriate
correction – many binary choices -> multiplicative effects
Key point: p-values can only be interpreted in terms of the context
in which they are computed
62. 1 contrast
Probability of a
‘significant’ p-value
< .05 = .05
Large population
database used to explore
link between ADHD and
handedness
https://figshare.com/articles/The_Garden_of_Forking_Paths/2100379
Demonstration of rapid expansion of comparisons with binary divisions
63. Focus just on Young
subgroup:
2 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .10
Large population
database used to explore
link between ADHD and
handedness
64. Focus just on Young on
measure of hand skill:
4 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .19
Large population
database used to explore
link between ADHD and
handedness
65. Focus just on Young,
Females on
measure of hand skill:
8 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .34
Large population
database used to explore
link between ADHD and
handedness
66. Focus just on Young,
Urban, Females on
measure of hand skill:
16 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .56
Large population
database used to explore
link between ADHD and
handedness
67. 1956
De Groot
Failure to distinguish between
hypothesis-testing and
hypothesis-generating
(exploratory) research
-> misuse of statistical tests
de Groot, A. D. (2014). The meaning of “significance” for
different types of research [translated and annotated by Eric-
Jan Wagenmakers, et al]. Acta Psychologica, 148, 188-194.
doi:http://dx.doi.org/10.1016/j.actpsy.2014.02.001
Further reading
68. R scripts available on : https://osf.io/view/reproducibility2017/
• Simulation_ex1_intro.R
Suitable for R newbies. Demonstrates ‘dance of the p-values’ in a t-test.
Bonus, you learn to make pirate plots
• Simulation_ex2_correlations
Generate correlation matrices from multivariate normal distribution.
Bonus, you learn to use ‘grid’ to make nicely formatted tabular outputs.
• Simulation_ex3_multiwayAnova.R
Simulate data for a 3-way mixed ANOVA. Demonstrates need to correct
for N factors and interactions when doing exploratory multiway Anova.
• Simulation_ex4_multipleReg.R
Simulate data for multiple regression.
• Simulation_ex5_falsediscovery.R
Simulate data for mixture of null and true effects, to demonstrate that
the probability of the data given the hypothesis is different from the
probability of the hypothesis given the data.
Two simulations from Daniel Lakens’ Coursera Course – with notes!
• 1.1 WhichPvaluesCanYouExpect.R
• 3.2 OptionalStoppingSim.R
Now even
more: See
OSF!