SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
1      Example of self-documenting data journalism notes
This is an example of using Sweave to combine code and output from the R statistical programming
environment and the LaTeX document processing environment to generate a self-documenting
script in which the actual code used to do stats and generate statistical graphics is displayed along
the charts it directly produces.

1.1     Getting Started...
The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical
analysis: experience the thrill of touching real data 1 .

>   # The << echo = T >>= identifies an R code region;
>   # echo=T means run the code, and print what happens when it's run
>   # In the code area, lines beginning with a # are comment lines and are not executed
>
>   #First, we need to load in the XML library that contains the scraper function
>   library(XML)
>   #Now we scrape the table
>   srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis'
>   cancerdata=data.frame(
+     readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) )
>
>   #The @ symbol on its own at the start of a line marks the end of a code block

   The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to
extract the N’th table in the page.) The header part labels the columns (the data pulled in from
the HTML table itself contains all sorts of clutter).
   We can inspect the data we’ve imported as follows:

>   #Look at the whole table (the whole table is quite long,
>   # so donlt disply it/comment out the command for now instead.
>   #cancerdata
>   #If you are using RStudio, you can inspect the data using the command: View(cancerdata))
>   #Look at the column headers
>   names(cancerdata)

[1] "Area"            "Rate"          "Population" "Number"

> #Look at the first 10 rows
> head(cancerdata)

              Area Rate Population Number
1 Shetland Islands 19.15     31332      6
2         Limavady 21.49     32573      7
3       Ballymoney 17.05     35191      6
4   Orkney Islands 29.87     36826     11
5            Larne 27.54     39942     11
6      Magherafelt 15.26     45872      7

> #Look at the last 10 rows
> tail(cancerdata)
    1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis




                                                   1
Area      Rate Population Number
374 Wiltshire      18.69     727662    136
375 Sheffield       16.9     757396    128
376     Durham     17.29     786582    136
377      Leeds      17.3     959538    166
378   Cornwall     15.44    1062176    164
379 Birmingham     19.78    1268959    251

> #What sort of datatype is in the Number column?
> class(cancerdata$Number)

[1] "factor"

   The last line, class(cancerdata$Number), identifies the data as type factor. In order to
do stats and plot graphs, we need the Number, Rate and Population columns to contain actual
numbers. (Factors organise data according to categories; when the table is loaded in, the data is
loaded in as strings of characters; rather than seeing each number as a number, it’s identified as
a category.) The

>   #Convert the numerical columns to a numeric datatype
>   cancerdata$Rate =
+     as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)])
>   cancerdata$Population =
+     as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
>   cancerdata$Number =
+     as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])
>                        a˘
    #Just check it worked^Ae
>   class(cancerdata$Number)

[1] "numeric"

> class(cancerdata$Rate)

[1] "numeric"

> class(cancerdata$Population)

[1] "numeric"

> head(cancerdata)

              Area Rate Population Number
1 Shetland Islands 19.15     31332      6
2         Limavady 21.49     32573      7
3       Ballymoney 17.05     35191      6
4   Orkney Islands 29.87     36826     11
5            Larne 27.54     39942     11
6      Magherafelt 15.26     45872      7

   We can now plot the data as a simple scatterplot using the plot command (figure 1) or we
can add a title to the graph and tweak the axis labels (figure 2).
   The plot command is great for generating quick charts. If we want a bit more control over
the charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard R
bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio,
find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its
dependencies...). You can see the sort of chart ggplot creates out of the box in figure 3.


                                                  2
> #Plot the Number of deaths by the Population
> plot(Number ~ Population, data=cancerdata)
                 250




                                                                                                   q




                                                                      q
                 200




                                                                                      q   q
                 150
        Number




                                                                          q       q
                                                                              q
                                                                  q
                                                             q
                                                             q
                                                           q q q qq
                 100




                                                          q q
                                                     q qq       q
                                                     q q q
                                                          qq
                                                q   q q q
                                                      q
                                                   q
                                                 qq q
                                            qqq q qq
                                               qq q
                                                q
                                              q qqq
                                                qq
                                           q q q
                                       q qq q q q q q
                                            q
                                                      q
                                                      q
                                       q q q qqq q
                 50




                                          q qqq q
                                                q
                                         q qq q
                                        qq
                                         qq qq
                                       qqq qqq q q
                                          q q
                                          q
                                      qqqqqq qq
                                      qqqqq
                                        qqqq
                                    q q qq qq
                                    qq q
                                   qqq qq qqq
                                    q q qq q
                                     qq
                                     qq q
                                     qq
                                  qqqq q q
                                     qq
                                 qqqqqq
                                 qqqqqq q
                                 qq qqq
                                  qqqq q
                                      q
                                  q qq q
                                 q qq q
                                  q qq q
                                      q
                                      q
                                qqqqqqqq
                                 q qq qq
                                qqqqqq
                                qqqqqqq
                                  qq
                                  qq
                              qqqq q
                                 qq
                              qqqq q
                               qqq
                                qq
                              q qqq
                                qq
                                 qq
                             qq q
                           q q
                            q
                                q q
                             q q qq
                              qq q
                               q q
                           qqq q
                           qqq
                           q q
                            q
                 0




                       0         200000 400000 600000 800000                                  1200000

                                                            Population



                                           Figure 1: Vanilla scatter plot




                                                               3
> #Plot the Number of deaths by the Population.
> #Add in a title (main) and tweak the y-axis label (ylab).
> plot(Number ~ Population, data=cancerdata,
+      main='Bowel Cancer Occurrence by Population', ylab='Number of deaths')



                                           Bowel Cancer Occurrence by Population
                           250




                                                                                                             q




                                                                                q
                           200
        Number of deaths




                                                                                                q   q
                           150




                                                                                    q       q
                                                                                        q
                                                                            q
                                                                       q
                                                                       q
                                                                     q q q qq
                           100




                                                                    q q
                                                               q qq       q
                                                               q q q
                                                                    qq
                                                          q   q q q
                                                                q
                                                             q
                                                           qq q
                                                      qqq q qq
                                                         qq q
                                                          q
                                                        q qqq
                                                          qq
                                                     q q q
                                                 q qq q q q q q
                                                      q
                                                                q
                                                                q
                                                 q q q qqq q
                           50




                                                    q qqq q
                                                          q
                                                   q qq q
                                                  qq
                                                   qq qq
                                                 qqq qqq q q
                                                    q q
                                                    q
                                                qqqqqq qq
                                                qqqqq
                                                  qqqq
                                              q q qq qq
                                              qq q
                                             qqq qq qqq
                                              q q qq q
                                               qq
                                               qq q
                                               qq
                                            qqqq q q
                                               qq
                                           qqqqqq
                                           qqqqqq q
                                           qq qqq
                                            qqqq q
                                                q
                                            q qq q
                                           q qq q
                                            q qq q
                                                q
                                                q
                                          qqqqqqqq
                                           q qq qq
                                          qqqqqq
                                          qqqqqqq
                                            qq
                                            qq
                                        qqqq q
                                           qq
                                        qqqq q
                                         qqq
                                          qq
                                        q qqq
                                          qq
                                           qq
                                       qq q
                                     q q
                                      q
                                          q q
                                       q q qq
                                        qq q
                                         q q
                                     qqq q
                                     qqq
                                     q q
                                      q
                           0




                                 0         200000 400000 600000 800000                                  1200000

                                                                      Population



                                                     Figure 2: Vanilla scatter plot




                                                                         4
>   require(ggplot2)
>   #Plot the Number of deaths by the Population
>   p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number))
>   print(p)


                                                                                                                       q
                    250




                                                                                      q

                    200



                                                                                                        q    q


                    150
           Number




                                                                                          q       q
                                                                                              q

                                                                              q
                                                                      q
                                                                      q
                                                                   q      q   q
                    100                                           q qq            q
                                                           q              q
                                                              qq
                                                               qq
                                                           q
                                                                 qq
                                                      q    qq q
                                                            q
                                                        qqq
                                                          q q
                                                  q q q qq q
                                                     qq
                                                    q qq q q
                                                       q
                                                      qq
                                                 q q     q q qq
                                            q qqq      q      q
                                                   qqq q
                                                       q
                                                 qq q q q
                    50                     q qq q q q q
                                                  q qq
                                           qqqqq qq q q
                                               q
                                            qqq q q q q
                                             qq q
                                               q
                                               q
                                          q q qq qq q
                                          q q qqq
                                          q qq q
                                             qqq q
                                      q qq qq qq
                                       q q qq q
                                     qqqq qqq qq
                                                qq
                                      qqq qqq q q
                                       q qq q
                                       q q q
                                       qq  q
                                                  q
                                   qq qqq qq
                                      qqq qq
                                      qqqqq
                                       qqq q
                                         q
                                    qqqqq qq
                                        qq q
                                         q
                                         qq q q q
                                   qqq qq q
                                   qq q
                                   qqqqq qq
                                   qqqqqq q
                                   q qqq q
                                    q
                                   q qq q q
                                    q qq
                                    q q
                                qq q q
                                  qqq
                                 qqqq
                                   q
                                  qqq
                                    q
                              qqqqqqq q
                              qqqqqq
                                 qqq q q
                                 qqq q
                              qqq qqq
                                 qq qq
                             qqqqqq
                                qq q
                               qq q   q
                          qq qq q qq
                           qqqqq q
                            qq q q
                                q     q
                          qq q
                          qq
                          q   q




                                    200000           400000           600000                  800000   1000000   1200000
                                                                  Population



                                                Figure 3: A rather prettier plot




                                                                              5
1.2    Generating the Funnel Plot
Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’s
article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated
to statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2
    The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing
the code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a couple
of things:

  1. work out what values to use where! I did this by looking at the ggplot code to see what
     was plotted. p was on the y-axis and should be used to present the death rate. The data
     provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the
     range 0..1. The x-axis is the population.
  2. change the range and width of samples used to create the curves
  3. change the y-axis range.

   You can see the result in figure 3.




   2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#

5210


                                                 6
>   #TH: funnel plot code from:
>   #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
>   #TH: Use our cancerdata
>   number=cancerdata$Population
>   #TH: The rate is given as a 'per 100,000' value, so normalise it
>   p=cancerdata$Rate/100000
>   p.se <- sqrt((p*(1-p)) / (number))
>   df <- data.frame(p, number, p.se, Area=cancerdata$Area)
>   ## common effect (fixed effect model)
>   p.fem <- weighted.mean(p, 1/p.se^2)
>   ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
>   #TH: I'm going to alter the spacing of the samples used to generate the curves
>   number.seq <- seq(1000, max(number), 1000)
>   number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)
>   ## draw plot
>   #TH: note that we need to tweak the limits of the y-axis
>   fp <- ggplot(aes(x = number, y = p), data = df) +
+   geom_point(shape = 1) +
+   geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
+   geom_hline(aes(yintercept = p.fem), data = dfCI) +
+   xlab("Population") + ylab("Bowel cancer death rate") + theme_bw()
>   #Automatically set the maximum y-axis value to be just a bit larger than the max data value
>   fp=fp+scale_y_continuous(limits = c(0,1.1*max(p)))
>   #Label the outlier point
>   fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003))
>   print(fp)




                                                                                           Glasgow City
                                                                                                q

                                     0.00030   q


                                               q            q
                                                qq
                                                      q
                                                q
                                                   qq          q
                                     0.00025           q qq          q q
                                                        qq
                                                       qq         qq
                                                                   q            q
                                                   qq q q
                                                       q
                                                       q qq
                                                 q q q qq              q
           Bowel cancer death rate




                                                                        q q      q
                                                        q q         q q
                                               q q q qq q q                   q
                                                 q     q qq q q q q
                                                        q        q
                                                        q q
                                                        q q              q q              q
                                                        q q q
                                                  q q q q qq                 q       q qq q
                                     0.00020          qq q q
                                                     q qqqq q
                                                    qq q                    q
                                                    qq q q qq qq         q                                                    q
                                               q      qq qqq q q q q
                                                               q
                                                               q
                                                               q
                                                              q q
                                                                          q
                                                                                  q q
                                                                                      q
                                                                                      q
                                                                                              q
                                                             q    q
                                                  q q q qq q q q q q              q       qq        q
                                                         q q q qqq
                                                           q      q q      qq q           q
                                                   q q qq qqq q q
                                                         q
                                                        q qqq
                                                                   q
                                                    qqq
                                                     qq q q qq q q     q         q q q        q             q    q
                                               q        q qqq qq q q
                                                             q q
                                                             q
                                                             q q                                        q
                                                    q qq q q q q q q q
                                                          q qq
                                                      q qq qq q
                                                             qqq
                                                    q q q qq qqqqq q q qq
                                                       q qq q qq
                                                              q          q              q   q
                                                                                                q
                                                          q q q
                                                      q q qqq q          q                                           q
                                                q q q            q q
                                     0.00015        q
                                                            qq q q qq q
                                                            qqq q qqq
                                                             qq      q q        q
                                                            qq               q
                                                     qq q q qq qqq q               q
                                                                                   q
                                                       qqqq q
                                                         qq           q      q
                                                      q qq q q
                                                         qq           q q
                                                                           q       q
                                                          q qqqq
                                                               q            qq
                                                          q      qq
                                                      q q q             q
                                                 q               q
                                                                 q     q
                                                   q
                                                                      q
                                     0.00010             q
                                                        qq
                                                          q
                                                          q
                                                   q     q




                                     0.00005
                                                                                       7

                                     0.00000


                                                          200000        400000          600000      800000      1000000 1200000
                                                                               Population

Contenu connexe

Similaire à Example sweavefunnelplot

Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & ApplicationsNavigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & ApplicationsRajarshi Guha
 
Stat7840 hao wu
Stat7840 hao wuStat7840 hao wu
Stat7840 hao wuHao Wu
 
Time series compare
Time series compareTime series compare
Time series comparelrhutyra
 
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...Rajarshi Guha
 
F1 2011 Korea Race Report
F1 2011 Korea Race ReportF1 2011 Korea Race Report
F1 2011 Korea Race ReportTony Hirst
 
Manual de Aplicação - TCC
Manual de Aplicação - TCCManual de Aplicação - TCC
Manual de Aplicação - TCCMarco Menezes
 
Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices Jason Riedy
 

Similaire à Example sweavefunnelplot (12)

Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & ApplicationsNavigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
 
Stat7840 hao wu
Stat7840 hao wuStat7840 hao wu
Stat7840 hao wu
 
Clustering Plot
Clustering PlotClustering Plot
Clustering Plot
 
Time series compare
Time series compareTime series compare
Time series compare
 
Slides lyon-2011
Slides lyon-2011Slides lyon-2011
Slides lyon-2011
 
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
 
Slides mcneil
Slides mcneilSlides mcneil
Slides mcneil
 
Slides geotop
Slides geotopSlides geotop
Slides geotop
 
F1 2011 Korea Race Report
F1 2011 Korea Race ReportF1 2011 Korea Race Report
F1 2011 Korea Race Report
 
Manual de Aplicação - TCC
Manual de Aplicação - TCCManual de Aplicação - TCC
Manual de Aplicação - TCC
 
Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices
 
Slides GEOTOP
Slides GEOTOPSlides GEOTOP
Slides GEOTOP
 

Plus de Tony Hirst

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiestaTony Hirst
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptxTony Hirst
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptxTony Hirst
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacksTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyterTony Hirst
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2Tony Hirst
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopTony Hirst
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireTony Hirst
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interestTony Hirst
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXTony Hirst
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefineTony Hirst
 
Conversations with data
Conversations with dataConversations with data
Conversations with dataTony Hirst
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingoTony Hirst
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Tony Hirst
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalismTony Hirst
 

Plus de Tony Hirst (20)

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
 
Week4
Week4Week4
Week4
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
 

Dernier

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Dernier (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Example sweavefunnelplot

  • 1. 1 Example of self-documenting data journalism notes This is an example of using Sweave to combine code and output from the R statistical programming environment and the LaTeX document processing environment to generate a self-documenting script in which the actual code used to do stats and generate statistical graphics is displayed along the charts it directly produces. 1.1 Getting Started... The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical analysis: experience the thrill of touching real data 1 . > # The << echo = T >>= identifies an R code region; > # echo=T means run the code, and print what happens when it's run > # In the code area, lines beginning with a # are comment lines and are not executed > > #First, we need to load in the XML library that contains the scraper function > library(XML) > #Now we scrape the table > srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis' > cancerdata=data.frame( + readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) ) > > #The @ symbol on its own at the start of a line marks the end of a code block The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to extract the N’th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter). We can inspect the data we’ve imported as follows: > #Look at the whole table (the whole table is quite long, > # so donlt disply it/comment out the command for now instead. > #cancerdata > #If you are using RStudio, you can inspect the data using the command: View(cancerdata)) > #Look at the column headers > names(cancerdata) [1] "Area" "Rate" "Population" "Number" > #Look at the first 10 rows > head(cancerdata) Area Rate Population Number 1 Shetland Islands 19.15 31332 6 2 Limavady 21.49 32573 7 3 Ballymoney 17.05 35191 6 4 Orkney Islands 29.87 36826 11 5 Larne 27.54 39942 11 6 Magherafelt 15.26 45872 7 > #Look at the last 10 rows > tail(cancerdata) 1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis 1
  • 2. Area Rate Population Number 374 Wiltshire 18.69 727662 136 375 Sheffield 16.9 757396 128 376 Durham 17.29 786582 136 377 Leeds 17.3 959538 166 378 Cornwall 15.44 1062176 164 379 Birmingham 19.78 1268959 251 > #What sort of datatype is in the Number column? > class(cancerdata$Number) [1] "factor" The last line, class(cancerdata$Number), identifies the data as type factor. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers. (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it’s identified as a category.) The > #Convert the numerical columns to a numeric datatype > cancerdata$Rate = + as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)]) > cancerdata$Population = + as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)]) > cancerdata$Number = + as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)]) > a˘ #Just check it worked^Ae > class(cancerdata$Number) [1] "numeric" > class(cancerdata$Rate) [1] "numeric" > class(cancerdata$Population) [1] "numeric" > head(cancerdata) Area Rate Population Number 1 Shetland Islands 19.15 31332 6 2 Limavady 21.49 32573 7 3 Ballymoney 17.05 35191 6 4 Orkney Islands 29.87 36826 11 5 Larne 27.54 39942 11 6 Magherafelt 15.26 45872 7 We can now plot the data as a simple scatterplot using the plot command (figure 1) or we can add a title to the graph and tweak the axis labels (figure 2). The plot command is great for generating quick charts. If we want a bit more control over the charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard R bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...). You can see the sort of chart ggplot creates out of the box in figure 3. 2
  • 3. > #Plot the Number of deaths by the Population > plot(Number ~ Population, data=cancerdata) 250 q q 200 q q 150 Number q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 1: Vanilla scatter plot 3
  • 4. > #Plot the Number of deaths by the Population. > #Add in a title (main) and tweak the y-axis label (ylab). > plot(Number ~ Population, data=cancerdata, + main='Bowel Cancer Occurrence by Population', ylab='Number of deaths') Bowel Cancer Occurrence by Population 250 q q 200 Number of deaths q q 150 q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 2: Vanilla scatter plot 4
  • 5. > require(ggplot2) > #Plot the Number of deaths by the Population > p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number)) > print(p) q 250 q 200 q q 150 Number q q q q q q q q q 100 q qq q q q qq qq q qq q qq q q qqq q q q q q qq q qq q qq q q q qq q q q q qq q qqq q q qqq q q qq q q q 50 q qq q q q q q qq qqqqq qq q q q qqq q q q q qq q q q q q qq qq q q q qqq q qq q qqq q q qq qq qq q q qq q qqqq qqq qq qq qqq qqq q q q qq q q q q qq q q qq qqq qq qqq qq qqqqq qqq q q qqqqq qq qq q q qq q q q qqq qq q qq q qqqqq qq qqqqqq q q qqq q q q qq q q q qq q q qq q q qqq qqqq q qqq q qqqqqqq q qqqqqq qqq q q qqq q qqq qqq qq qq qqqqqq qq q qq q q qq qq q qq qqqqq q qq q q q q qq q qq q q 200000 400000 600000 800000 1000000 1200000 Population Figure 3: A rather prettier plot 5
  • 6. 1.2 Generating the Funnel Plot Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’s article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated to statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2 The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a couple of things: 1. work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population. 2. change the range and width of samples used to create the curves 3. change the y-axis range. You can see the result in figure 3. 2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210# 5210 6
  • 7. > #TH: funnel plot code from: > #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210 > #TH: Use our cancerdata > number=cancerdata$Population > #TH: The rate is given as a 'per 100,000' value, so normalise it > p=cancerdata$Rate/100000 > p.se <- sqrt((p*(1-p)) / (number)) > df <- data.frame(p, number, p.se, Area=cancerdata$Area) > ## common effect (fixed effect model) > p.fem <- weighted.mean(p, 1/p.se^2) > ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator > #TH: I'm going to alter the spacing of the samples used to generate the curves > number.seq <- seq(1000, max(number), 1000) > number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem) > ## draw plot > #TH: note that we need to tweak the limits of the y-axis > fp <- ggplot(aes(x = number, y = p), data = df) + + geom_point(shape = 1) + + geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) + + geom_hline(aes(yintercept = p.fem), data = dfCI) + + xlab("Population") + ylab("Bowel cancer death rate") + theme_bw() > #Automatically set the maximum y-axis value to be just a bit larger than the max data value > fp=fp+scale_y_continuous(limits = c(0,1.1*max(p))) > #Label the outlier point > fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003)) > print(fp) Glasgow City q 0.00030 q q q qq q q qq q 0.00025 q qq q q qq qq qq q q qq q q q q qq q q q qq q Bowel cancer death rate q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q qq q 0.00020 qq q q q qqqq q qq q q qq q q qq qq q q q qq qqq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qqq q q q qq q q q q qq qqq q q q q qqq q qqq qq q q qq q q q q q q q q q q q qqq qq q q q q q q q q q qq q q q q q q q q qq q qq qq q qqq q q q qq qqqqq q q qq q qq q qq q q q q q q q q q q qqq q q q q q q q q 0.00015 q qq q q qq q qqq q qqq qq q q q qq q qq q q qq qqq q q q qqqq q qq q q q qq q q qq q q q q q qqqq q qq q qq q q q q q q q q q q 0.00010 q qq q q q q 0.00005 7 0.00000 200000 400000 600000 800000 1000000 1200000 Population