WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Example sweavefunnelplot
1. 1 Example of self-documenting data journalism notes
This is an example of using Sweave to combine code and output from the R statistical programming
environment and the LaTeX document processing environment to generate a self-documenting
script in which the actual code used to do stats and generate statistical graphics is displayed along
the charts it directly produces.
1.1 Getting Started...
The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical
analysis: experience the thrill of touching real data 1 .
> # The << echo = T >>= identifies an R code region;
> # echo=T means run the code, and print what happens when it's run
> # In the code area, lines beginning with a # are comment lines and are not executed
>
> #First, we need to load in the XML library that contains the scraper function
> library(XML)
> #Now we scrape the table
> srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis'
> cancerdata=data.frame(
+ readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) )
>
> #The @ symbol on its own at the start of a line marks the end of a code block
The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to
extract the N’th table in the page.) The header part labels the columns (the data pulled in from
the HTML table itself contains all sorts of clutter).
We can inspect the data we’ve imported as follows:
> #Look at the whole table (the whole table is quite long,
> # so donlt disply it/comment out the command for now instead.
> #cancerdata
> #If you are using RStudio, you can inspect the data using the command: View(cancerdata))
> #Look at the column headers
> names(cancerdata)
[1] "Area" "Rate" "Population" "Number"
> #Look at the first 10 rows
> head(cancerdata)
Area Rate Population Number
1 Shetland Islands 19.15 31332 6
2 Limavady 21.49 32573 7
3 Ballymoney 17.05 35191 6
4 Orkney Islands 29.87 36826 11
5 Larne 27.54 39942 11
6 Magherafelt 15.26 45872 7
> #Look at the last 10 rows
> tail(cancerdata)
1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis
1
2. Area Rate Population Number
374 Wiltshire 18.69 727662 136
375 Sheffield 16.9 757396 128
376 Durham 17.29 786582 136
377 Leeds 17.3 959538 166
378 Cornwall 15.44 1062176 164
379 Birmingham 19.78 1268959 251
> #What sort of datatype is in the Number column?
> class(cancerdata$Number)
[1] "factor"
The last line, class(cancerdata$Number), identifies the data as type factor. In order to
do stats and plot graphs, we need the Number, Rate and Population columns to contain actual
numbers. (Factors organise data according to categories; when the table is loaded in, the data is
loaded in as strings of characters; rather than seeing each number as a number, it’s identified as
a category.) The
> #Convert the numerical columns to a numeric datatype
> cancerdata$Rate =
+ as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)])
> cancerdata$Population =
+ as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
> cancerdata$Number =
+ as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])
> a˘
#Just check it worked^Ae
> class(cancerdata$Number)
[1] "numeric"
> class(cancerdata$Rate)
[1] "numeric"
> class(cancerdata$Population)
[1] "numeric"
> head(cancerdata)
Area Rate Population Number
1 Shetland Islands 19.15 31332 6
2 Limavady 21.49 32573 7
3 Ballymoney 17.05 35191 6
4 Orkney Islands 29.87 36826 11
5 Larne 27.54 39942 11
6 Magherafelt 15.26 45872 7
We can now plot the data as a simple scatterplot using the plot command (figure 1) or we
can add a title to the graph and tweak the axis labels (figure 2).
The plot command is great for generating quick charts. If we want a bit more control over
the charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard R
bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio,
find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its
dependencies...). You can see the sort of chart ggplot creates out of the box in figure 3.
2
6. 1.2 Generating the Funnel Plot
Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’s
article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated
to statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2
The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing
the code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a couple
of things:
1. work out what values to use where! I did this by looking at the ggplot code to see what
was plotted. p was on the y-axis and should be used to present the death rate. The data
provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the
range 0..1. The x-axis is the population.
2. change the range and width of samples used to create the curves
3. change the y-axis range.
You can see the result in figure 3.
2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#
5210
6
7. > #TH: funnel plot code from:
> #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
> #TH: Use our cancerdata
> number=cancerdata$Population
> #TH: The rate is given as a 'per 100,000' value, so normalise it
> p=cancerdata$Rate/100000
> p.se <- sqrt((p*(1-p)) / (number))
> df <- data.frame(p, number, p.se, Area=cancerdata$Area)
> ## common effect (fixed effect model)
> p.fem <- weighted.mean(p, 1/p.se^2)
> ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
> #TH: I'm going to alter the spacing of the samples used to generate the curves
> number.seq <- seq(1000, max(number), 1000)
> number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
> number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
> number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
> number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
> dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)
> ## draw plot
> #TH: note that we need to tweak the limits of the y-axis
> fp <- ggplot(aes(x = number, y = p), data = df) +
+ geom_point(shape = 1) +
+ geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
+ geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
+ geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
+ geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
+ geom_hline(aes(yintercept = p.fem), data = dfCI) +
+ xlab("Population") + ylab("Bowel cancer death rate") + theme_bw()
> #Automatically set the maximum y-axis value to be just a bit larger than the max data value
> fp=fp+scale_y_continuous(limits = c(0,1.1*max(p)))
> #Label the outlier point
> fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003))
> print(fp)
Glasgow City
q
0.00030 q
q q
qq
q
q
qq q
0.00025 q qq q q
qq
qq qq
q q
qq q q
q
q qq
q q q qq q
Bowel cancer death rate
q q q
q q q q
q q q qq q q q
q q qq q q q q
q q
q q
q q q q q
q q q
q q q q qq q q qq q
0.00020 qq q q
q qqqq q
qq q q
qq q q qq qq q q
q qq qqq q q q q
q
q
q
q q
q
q q
q
q
q
q q
q q q qq q q q q q q qq q
q q q qqq
q q q qq q q
q q qq qqq q q
q
q qqq
q
qqq
qq q q qq q q q q q q q q q
q q qqq qq q q
q q
q
q q q
q qq q q q q q q q
q qq
q qq qq q
qqq
q q q qq qqqqq q q qq
q qq q qq
q q q q
q
q q q
q q qqq q q q
q q q q q
0.00015 q
qq q q qq q
qqq q qqq
qq q q q
qq q
qq q q qq qqq q q
q
qqqq q
qq q q
q qq q q
qq q q
q q
q qqqq
q qq
q qq
q q q q
q q
q q
q
q
0.00010 q
qq
q
q
q q
0.00005
7
0.00000
200000 400000 600000 800000 1000000 1200000
Population