Created
November 1, 2011 10:54
-
-
Save psychemedia/1330309 to your computer and use it in GitHub Desktop.
Sweave script documenting the production of a funnel plot
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
\documentclass[a4paper]{article} | |
\SweaveOpts{echo=FALSE, keep.source=TRUE} | |
\usepackage{a4wide} | |
\usepackage{color} | |
\usepackage{hyperref} | |
\begin{document} | |
\section{Example of self-documenting data journalism notes} | |
This is an example of using Sweave to combine code and output from the R statistical programming environment and the LaTeX document processing environment to generate a self-documenting script in which the actual code used to do stats and generate statistical graphics is displayed along the charts it directly produces. | |
\subsection{Getting Started...} | |
The aim is to try to replicate a graphic included by Ben Goldacre in his article \emph{DIY statistical analysis: experience the thrill of touching real data}\footnote{\url{http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis}}. | |
<< echo = T >>= | |
# The << echo = T >>= identifies an R code region; | |
# echo=T means run the code, and print what happens when it's run | |
# In the code area, lines beginning with a # are comment lines and are not executed | |
#First, we need to load in the XML library that contains the scraper function | |
library(XML) | |
#Now we scrape the table | |
srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis' | |
cancerdata=data.frame( | |
readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) ) | |
#The @ symbol on its own at the start of a line marks the end of a code block | |
@ | |
The format is simple: {\tt readHTMLTable(url,which=TABLENUMBER)} ({\tt TABLENUMBER} is used to extract the N'th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter). | |
We can inspect the data we've imported as follows: | |
<< echo = T >>= | |
#Look at the whole table (the whole table is quite long, | |
# so don't disply it/comment out the command for now instead. | |
#cancerdata | |
#If you are using RStudio, you can inspect the data using the command: View(cancerdata)) | |
#Look at the column headers | |
names(cancerdata) | |
#If there's a problem with the header command not doing it's stuff, | |
# uncomment the next two lines to fix it and check the fix... | |
#names(cancerdata)<-c('Area','Rate','Population','Number') | |
#names(cancerdata) | |
#Look at the first 10 rows | |
head(cancerdata) | |
#Look at the last 10 rows | |
tail(cancerdata) | |
#What sort of datatype is in the Number column? | |
class(cancerdata$Number) | |
@ | |
The last line, {\tt class(cancerdata\$Number)}, identifies the data as type \emph{factor}. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers. (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it's identified as a category.) The | |
<< echo=T >>= | |
#Convert the numerical columns to a numeric datatype | |
cancerdata$Rate = | |
as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)]) | |
cancerdata$Population = | |
as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)]) | |
cancerdata$Number = | |
as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)]) | |
#Just check it worked… | |
class(cancerdata$Number) | |
class(cancerdata$Rate) | |
class(cancerdata$Population) | |
head(cancerdata) | |
@ | |
We can now plot the data as a simple scatterplot using the {\tt plot} command (figure \ref{fig:simpleplot}) or we can add a title to the graph and tweak the axis labels (figure \ref{fig:simpleplot2}). | |
\begin{figure} | |
\begin{center} | |
<<label=fig_simpleplot,fig=TRUE,echo=T>>= | |
#Plot the Number of deaths by the Population | |
plot(Number ~ Population, data=cancerdata) | |
@ | |
\end{center} | |
\caption{Vanilla scatter plot} | |
\label{fig:simpleplot} | |
\end{figure} | |
\begin{figure} | |
\begin{center} | |
<<label=fig_simpleplot2,fig=TRUE,echo=T>>= | |
#Plot the Number of deaths by the Population. | |
#Add in a title (main) and tweak the y-axis label (ylab). | |
plot(Number ~ Population, data=cancerdata, | |
main='Bowel Cancer Occurrence by Population', ylab='Number of deaths') | |
@ | |
\end{center} | |
\caption{Vanilla scatter plot} | |
\label{fig:simpleplot2} | |
\end{figure} | |
The {\tt plot} command is great for generating quick charts. If we want a bit more control over the charts we produce, the {\tt ggplot2} library is the way to go. (ggplot2 isn't part of the standard R bundle, so you'll need to install the package yourself if you haven't already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...). You can see the sort of chart ggplot creates out of the box in figure \ref{fig:simpleggplot}. | |
\begin{figure} | |
\begin{center} | |
<<label=fig_simpleggplot,fig=TRUE,echo=T>>= | |
require(ggplot2) | |
#Plot the Number of deaths by the Population | |
p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number)) | |
print(p) | |
@ | |
\end{center} | |
\caption{A rather prettier plot} | |
\label{fig:simpleggplot} | |
\end{figure} | |
\newpage | |
\subsection{Generating the Funnel Plot} | |
Doing a bit of searching for the ``funnel plot'' chart type used to display the data in Goldacre's article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated to statistics related Q\&A: \emph{How to draw funnel plot using ggplot2 in R?}\footnote{\url{http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210\#5210}} | |
The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code, with confidence limits set at the 95\% and 99.9\% levels. Note that I needed to do a couple of things: | |
\begin{enumerate} | |
\item work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population. | |
\item change the range and width of samples used to create the curves | |
\item change the y-axis range. | |
\end{enumerate} | |
You can see the result in figure \ref{fig:simpleggplot}. | |
\begin{figure} | |
\begin{center} | |
<<label=fig_funnelplot,fig=TRUE,echo=T>>= | |
#TH: funnel plot code from: | |
#stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210 | |
#TH: Use our cancerdata | |
number=cancerdata$Population | |
#TH: The rate is given as a 'per 100,000' value, so normalise it | |
p=cancerdata$Rate/100000 | |
p.se <- sqrt((p*(1-p)) / (number)) | |
df <- data.frame(p, number, p.se, Area=cancerdata$Area) | |
## common effect (fixed effect model) | |
p.fem <- weighted.mean(p, 1/p.se^2) | |
## lower and upper limits for 95% and 99.9% CI, based on FEM estimator | |
#TH: I'm going to alter the spacing of the samples used to generate the curves | |
number.seq <- seq(1000, max(number), 1000) | |
number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) | |
number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) | |
number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) | |
number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) | |
dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem) | |
## draw plot | |
#TH: note that we need to tweak the limits of the y-axis | |
fp <- ggplot(aes(x = number, y = p), data = df) + | |
geom_point(shape = 1) + | |
geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) + | |
geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) + | |
geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) + | |
geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) + | |
geom_hline(aes(yintercept = p.fem), data = dfCI) + | |
xlab("Population") + ylab("Bowel cancer death rate") + theme_bw() | |
#Automatically set the maximum y-axis value to be just a bit larger than the max data value | |
fp=fp+scale_y_continuous(limits = c(0,1.1*max(p))) | |
#Label the outlier point | |
fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003)) | |
print(fp) | |
@ | |
\end{center} | |
\caption{A rather prettier plot} | |
\label{fig:funnelplot} | |
\end{figure} | |
\end{document} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment