Created
May 27, 2013 22:30
-
-
Save jknowles/5659390 to your computer and use it in GitHub Desktop.
R Markdown of blog post on R minimal working examples (MWE).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
How to Ask for Help using R | |
======================================================== | |
The key to getting good help with an R problem is to provide a minimally working | |
reproducible example (MWRE). Making an MWRE is really easy with R, and it will | |
help ensure that those helping you can identify the source of the error, and | |
ideally submit to you back the corrected code to fix the error instead of sending | |
you hunting for code that works. To have an MWRE you need the following items: | |
- a minimal dataset that produces the error | |
- the minimal runnable code necessary to produce the data, run on the dataset | |
provided | |
- the necessary information on the used packages, R version, and system | |
- a `seed` value, if random properties are part of the code | |
Let's look at the tools available in R to help us create each of these components | |
quickly and easily. | |
### Producing a Minimal Dataset | |
There are three distinct options here: | |
1. Use a built in R dataset | |
2. Create a new vector / data.frame from scratch | |
3. Output the data you are currently working on in a shareable way | |
Let's look at each of these in turn and see the tools R has to help us do this. | |
#### Built in Datasets | |
There are a few canonical buit in R datasets that are really attractive for use in | |
help requests. | |
- mtcars | |
- diamonds (from ggplot2) | |
- iris | |
To see all the available datasets in R, simply type: `data()`. To load any of | |
these datasets, simply use the following: | |
```{r, comment=NA} | |
data(mtcars) | |
head(mtcars) # to look at the data | |
``` | |
This option works great for a problem where you know you are having trouble with | |
a command in R. It is not a great option if you are having trouble understanding | |
why a command you are familiar with won't work on your data. | |
Note that for education data that is fairly "realistic", there are built in | |
simulated datasets in the `eeptools` package, created by Jared Knowles. | |
```{r eeptoolsdemo, message=FALSE, warning=FALSE, comment=NA} | |
library(eeptools) | |
data(stulevel) | |
names(stulevel) | |
``` | |
#### Create Your Own Data | |
Inputing data into R and sharing it back out with others is really easy. Part of | |
the power of R is the ability to create diverse data structures very easily. | |
Let's create a simulated data frame of student test scores and demographics. | |
```{r createdata, comment=NA} | |
Data <- data.frame( | |
id = seq(1, 1000), | |
gender = sample(c("male", "female"), 1000, replace = TRUE), | |
mathSS = rnorm(1000, mean = 400, sd = 60), | |
readSS = rnorm(1000, mean= 370, sd = 58.3), | |
race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE) | |
) | |
head(Data) | |
``` | |
And, just like that, we have simulated student data. This is a great way to | |
evaluate problems with plotting data or with large datasets, since we can ask | |
R to generate a random dataset that is incredibly large if necessary. However, | |
let's look at the relationship among our variables using a quick plot: | |
```{r evalsimmeddata} | |
library(ggplot2) | |
qplot(mathSS, readSS, data=Data, color=race) + theme_bw() | |
``` | |
It looks like race is pretty evenly distributed and there is no relationship | |
among `mathSS` and `readSS`. For some applications this data is sufficient, but | |
for others we may wish for data that is more realistic. | |
```{r evalsimmeddata2, comment=NA} | |
table(Data$race) | |
cor(Data$mathSS, Data$readSS) | |
``` | |
#### Output Your Current Data | |
Sometimes you just want to show others the data you are using and see why | |
the problem won't work. The best practice here is to make a subset of the data | |
you are working on, and then output it using the `dput` command. | |
```{r dataoutput, comment=NA} | |
dput(head(stulevel, 5)) | |
``` | |
The resulting code can be copied and pasted into an R terminal and it will | |
automatically build the dataset up exactly as described. Note, that in the above | |
example, it might have been better if I first cut out all the unnecessary | |
variables for my problem before I executed the `dput` command. The goal is to | |
make the data only necessary to reproduce your code available. | |
Also, note, that we never send **student level** data from LDS over e-mail | |
as this is unsecure. For work on student level data, it is better to either | |
simulate the data or to use the built in simulated data from the `eeptools` | |
package to run your examples. | |
#### Anonymizing Your Data | |
It may also be the case that you want to `dput` your data, but you want to keep | |
the contents of your data anonymous. A Google search came up with a decent | |
looking function to carry this out: | |
```{r anonymizedata, comment=NA} | |
anonym<-function(df){ | |
if(length(df)>26){ | |
LETTERS<-replicate(floor(length(df)/26),{LETTERS<-c(LETTERS, paste(LETTERS, LETTERS, sep=""))}) | |
} | |
names(df)<-paste(LETTERS[1:length(df)]) | |
level.id.df<-function(df){ | |
level.id<-function(i){ | |
if(class(df[,i])=="factor" | class(df[,i])=="character"){ | |
column<-paste(names(df)[i],as.numeric(as.factor(df[,i])), sep=".")}else if(is.numeric(df[,i])){ | |
column<-df[,i]/mean(df[,i], na.rm=T)}else{column<-df[,i]} | |
return(column)} | |
DF <- data.frame(sapply(seq_along(df), level.id)) | |
names(DF) <- names(df) | |
return(DF)} | |
df<-level.id.df(df) | |
return(df)} | |
test <- anonym(stulevel) | |
head(test[, c(2:6, 28:32)]) | |
``` | |
That looks pretty generic and anonymized to me! | |
#### Notes | |
- Most of these solutions do not include missing data (NAs) which are often the | |
source of problems in R. That limits their usefulness. | |
- So, always check for NA values. | |
### Creating the Example | |
Once we have our minimal dataset, we need to reproduce our error on *that dataset.* | |
This part is critical. If the error goes away when you apply your code to the | |
minimal dataset, then it will be very hard for others to diagnose the problem | |
remotely, and it might be time to get some "at your desk" help. | |
Let's look at an example where we have an error aggregating data. Let's assume | |
I am creating a new data frame for my example, and trying to aggregate that data | |
by race. | |
```{r aggregationproblems, comment=NA} | |
Data <- data.frame( | |
id = seq(1, 1000), | |
gender = sample(c("male", "female"), 1000, replace = TRUE), | |
mathSS = rnorm(1000, mean = 400, sd = 60), | |
readSS = rnorm(1000, mean= 370, sd = 58.3), | |
race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE) | |
) | |
myAgg <- Data[, list(meanM = mean(mathSS)), by= race] | |
head(myAgg) | |
``` | |
Why do I get an error? Well, if you sent the above code to someone, they could | |
quickly evaluate it for errors, and look at the mistake if they knew you were | |
attempting to use the data.table package. | |
```{r aggregationsolution, comment=NA, warning=FALSE} | |
library(data.table) | |
Data <- data.frame( | |
id = seq(1, 1000), | |
gender = sample(c("male", "female"), 1000, replace = TRUE), | |
mathSS = rnorm(1000, mean = 400, sd = 60), | |
readSS = rnorm(1000, mean= 370, sd = 58.3), | |
race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE) | |
) | |
Data <- data.table(Data) | |
myAgg <- Data[, list(meanM = mean(mathSS)), by= race] | |
head(myAgg) | |
``` | |
### Session Info | |
However, they might not know this, so we need to provide one final piece of | |
information. This is known was the `sessionInfo` for our R session. To diagnose | |
the error it is necessary to know what system you are running on, what packages | |
are loaded in your workspace, and what version of R and a given package you are | |
using. | |
Thankfully, R makes this incredibly easy. Just tack on the output from the | |
`sessionInfo()` function. This is easy enough to copy and paste or include in | |
a `knitr` document. | |
```{r sessioninfo, comment=NA} | |
sessionInfo() | |
``` | |
### Resources | |
For more information, visit: | |
- [http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) | |
- [https://github.com/hadley/devtools/wiki/Reproducibility](https://github.com/hadley/devtools/wiki/Reproducibility) | |
- [http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688](http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment