jknowles · May 27, 2013 22:30
diff --git a/MWEblogpost.Rmd b/MWEblogpost.Rmd
 How to Ask for Help using R
 ========================================================

 The key to getting good help with an R problem is to provide a minimally working 
 reproducible example (MWRE). Making an MWRE is really easy with R, and it will 
 help ensure that those helping you can identify the source of the error, and 
 ideally submit to you back the corrected code to fix the error instead of sending
 you hunting for code that works. To have an MWRE you need the following items:

 - a minimal dataset that produces the error
 - the minimal runnable code necessary to produce the data, run on the dataset
 provided
 - the necessary information on the used packages, R version, and system
 - a `seed` value, if random properties are part of the code

 Let's look at the tools available in R to help us create each of these components
 quickly and easily. 

 ### Producing a Minimal Dataset

 There are three distinct options here:

 1. Use a built in R dataset
 2. Create a new vector / data.frame from scratch
 3. Output the data you are currently working on in a shareable way

 Let's look at each of these in turn and see the tools R has to help us do this. 

 #### Built in Datasets

 There are a few canonical buit in R datasets that are really attractive for use in
 help requests. 

 - mtcars
 - diamonds (from ggplot2)
 - iris

 To see all the available datasets in R, simply type: `data()`. To load any of 
 these datasets, simply use the following:

 ```{r, comment=NA}
 data(mtcars)
 head(mtcars) # to look at the data
 ```

 This option works great for a problem where you know you are having trouble with
 a command in R. It is not a great option if you are having trouble understanding
 why a command you are familiar with won't work on your data. 

 Note that for education data that is fairly "realistic", there are built in 
 simulated datasets in the `eeptools` package, created by Jared Knowles.

 ```{r eeptoolsdemo, message=FALSE, warning=FALSE, comment=NA}
 library(eeptools)
 data(stulevel)
 names(stulevel)
 ```

 #### Create Your Own Data

 Inputing data into R and sharing it back out with others is really easy. Part of
 the power of R is the ability to create diverse data structures very easily. 
 Let's create a simulated data frame of student test scores and demographics.

 ```{r createdata, comment=NA}
 Data <- data.frame(
    id     = seq(1, 1000),
    gender = sample(c("male", "female"), 1000, replace = TRUE),
    mathSS = rnorm(1000, mean = 400, sd = 60),
    readSS = rnorm(1000, mean= 370, sd = 58.3),
    race   = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
 )

 head(Data)
 ```

 And, just like that, we have simulated student data. This is a great way to 
 evaluate problems with plotting data or with large datasets, since we can ask
 R to generate a random dataset that is incredibly large if necessary. However, 
 let's look at the relationship among our variables using a quick plot:

 ```{r evalsimmeddata}
 library(ggplot2)
 qplot(mathSS, readSS, data=Data, color=race) + theme_bw()
 ```

 It looks like race is pretty evenly distributed and there is no relationship 
 among `mathSS` and `readSS`. For some applications this data is sufficient, but 
 for others we may wish for data that is more realistic.

 ```{r evalsimmeddata2, comment=NA}
 table(Data$race)
 cor(Data$mathSS, Data$readSS)
 ```


 #### Output Your Current Data

 Sometimes you just want to show others the data you are using and see why 
 the problem won't work. The best practice here is to make a subset of the data 
 you are working on, and then output it using the `dput` command. 

 ```{r dataoutput, comment=NA}
 dput(head(stulevel, 5))

 ```

 The resulting code can be copied and pasted into an R terminal and it will 
 automatically build the dataset up exactly as described. Note, that in the above 
 example, it might have been better if I first cut out all the unnecessary 
 variables for my problem before I executed the `dput` command. The goal is to
 make the data only necessary to reproduce your code available. 

 Also, note, that we never send **student level** data from LDS over e-mail 
 as this is unsecure. For work on student level data, it is better to either 
 simulate the data or to use the built in simulated data from the `eeptools` 
 package to run your examples. 

 #### Anonymizing Your Data

 It may also be the case that you want to `dput` your data, but you want to keep
 the contents of your data anonymous. A Google search came up with a decent
 looking function to carry this out:

 ```{r anonymizedata, comment=NA}
 anonym<-function(df){
  if(length(df)>26){
    LETTERS<-replicate(floor(length(df)/26),{LETTERS<-c(LETTERS, paste(LETTERS, LETTERS, sep=""))})
    }
    names(df)<-paste(LETTERS[1:length(df)])

    level.id.df<-function(df){
        level.id<-function(i){
      if(class(df[,i])=="factor" | class(df[,i])=="character"){
        column<-paste(names(df)[i],as.numeric(as.factor(df[,i])), sep=".")}else if(is.numeric(df[,i])){
          column<-df[,i]/mean(df[,i], na.rm=T)}else{column<-df[,i]}
          return(column)}
      DF <- data.frame(sapply(seq_along(df), level.id))
      names(DF) <- names(df)
      return(DF)}
    df<-level.id.df(df)
    return(df)}

 test <- anonym(stulevel)
 head(test[, c(2:6, 28:32)])
 ```

 That looks pretty generic and anonymized to me!

 #### Notes

 - Most of these solutions do not include missing data (NAs) which are often the 
 source of problems in R. That limits their usefulness.
 - So, always check for NA values.

 ### Creating the Example

 Once we have our minimal dataset, we need to reproduce our error on *that dataset.*
 This part is critical. If the error goes away when you apply your code to the 
 minimal dataset, then it will be very hard for others to diagnose the problem 
 remotely, and it might be time to get some "at your desk" help. 

 Let's look at an example where we have an error aggregating data. Let's assume
 I am creating a new data frame for my example, and trying to aggregate that data
 by race. 

 ```{r aggregationproblems, comment=NA}
 Data <- data.frame(
    id     = seq(1, 1000),
    gender = sample(c("male", "female"), 1000, replace = TRUE),
    mathSS = rnorm(1000, mean = 400, sd = 60),
    readSS = rnorm(1000, mean= 370, sd = 58.3),
    race   = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
 )

 myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
 head(myAgg)
 ```

 Why do I get an error? Well, if you sent the above code to someone, they could
 quickly evaluate it for errors, and look at the mistake if they knew you were 
 attempting to use the data.table package. 

 ```{r aggregationsolution, comment=NA, warning=FALSE}
 library(data.table)
 Data <- data.frame(
    id     = seq(1, 1000),
    gender = sample(c("male", "female"), 1000, replace = TRUE),
    mathSS = rnorm(1000, mean = 400, sd = 60),
    readSS = rnorm(1000, mean= 370, sd = 58.3),
    race   = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
 )

 Data <- data.table(Data)
 myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
 head(myAgg)
 ```

 ### Session Info

 However, they might not know this, so we need to provide one final piece of 
 information. This is known was the `sessionInfo` for our R session. To diagnose
 the error it is necessary to know what system you are running on, what packages 
 are loaded in your workspace, and what version of R and a given package you are
 using.

 Thankfully, R makes this incredibly easy. Just tack on the output from the
 `sessionInfo()` function. This is easy enough to copy and paste or include in 
 a `knitr` document. 

 ```{r sessioninfo, comment=NA}
 sessionInfo()
 ```


 ### Resources

 For more information, visit:

 - [http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)
 - [https://github.com/hadley/devtools/wiki/Reproducibility](https://github.com/hadley/devtools/wiki/Reproducibility)
 - [http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688](http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688)
	How to Ask for Help using R
	========================================================

	The key to getting good help with an R problem is to provide a minimally working
	reproducible example (MWRE). Making an MWRE is really easy with R, and it will
	help ensure that those helping you can identify the source of the error, and
	ideally submit to you back the corrected code to fix the error instead of sending
	you hunting for code that works. To have an MWRE you need the following items:

	- a minimal dataset that produces the error
	- the minimal runnable code necessary to produce the data, run on the dataset
	provided
	- the necessary information on the used packages, R version, and system
	- a `seed` value, if random properties are part of the code

	Let's look at the tools available in R to help us create each of these components
	quickly and easily.

	### Producing a Minimal Dataset

	There are three distinct options here:

	1. Use a built in R dataset
	2. Create a new vector / data.frame from scratch
	3. Output the data you are currently working on in a shareable way

	Let's look at each of these in turn and see the tools R has to help us do this.

	#### Built in Datasets

	There are a few canonical buit in R datasets that are really attractive for use in
	help requests.

	- mtcars
	- diamonds (from ggplot2)
	- iris

	To see all the available datasets in R, simply type: `data()`. To load any of
	these datasets, simply use the following:

	```{r, comment=NA}
	data(mtcars)
	head(mtcars) # to look at the data
	```

	This option works great for a problem where you know you are having trouble with
	a command in R. It is not a great option if you are having trouble understanding
	why a command you are familiar with won't work on your data.

	Note that for education data that is fairly "realistic", there are built in
	simulated datasets in the `eeptools` package, created by Jared Knowles.

	```{r eeptoolsdemo, message=FALSE, warning=FALSE, comment=NA}
	library(eeptools)
	data(stulevel)
	names(stulevel)
	```

	#### Create Your Own Data

	Inputing data into R and sharing it back out with others is really easy. Part of
	the power of R is the ability to create diverse data structures very easily.
	Let's create a simulated data frame of student test scores and demographics.

	```{r createdata, comment=NA}
	Data <- data.frame(
	id = seq(1, 1000),
	gender = sample(c("male", "female"), 1000, replace = TRUE),
	mathSS = rnorm(1000, mean = 400, sd = 60),
	readSS = rnorm(1000, mean= 370, sd = 58.3),
	race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
	)

	head(Data)
	```

	And, just like that, we have simulated student data. This is a great way to
	evaluate problems with plotting data or with large datasets, since we can ask
	R to generate a random dataset that is incredibly large if necessary. However,
	let's look at the relationship among our variables using a quick plot:

	```{r evalsimmeddata}
	library(ggplot2)
	qplot(mathSS, readSS, data=Data, color=race) + theme_bw()
	```

	It looks like race is pretty evenly distributed and there is no relationship
	among `mathSS` and `readSS`. For some applications this data is sufficient, but
	for others we may wish for data that is more realistic.

	```{r evalsimmeddata2, comment=NA}
	table(Data$race)
	cor(Data$mathSS, Data$readSS)
	```


	#### Output Your Current Data

	Sometimes you just want to show others the data you are using and see why
	the problem won't work. The best practice here is to make a subset of the data
	you are working on, and then output it using the `dput` command.

	```{r dataoutput, comment=NA}
	dput(head(stulevel, 5))

	```

	The resulting code can be copied and pasted into an R terminal and it will
	automatically build the dataset up exactly as described. Note, that in the above
	example, it might have been better if I first cut out all the unnecessary
	variables for my problem before I executed the `dput` command. The goal is to
	make the data only necessary to reproduce your code available.

	Also, note, that we never send student level data from LDS over e-mail
	as this is unsecure. For work on student level data, it is better to either
	simulate the data or to use the built in simulated data from the `eeptools`
	package to run your examples.

	#### Anonymizing Your Data

	It may also be the case that you want to `dput` your data, but you want to keep
	the contents of your data anonymous. A Google search came up with a decent
	looking function to carry this out:

	```{r anonymizedata, comment=NA}
	anonym<-function(df){
	if(length(df)>26){
	LETTERS<-replicate(floor(length(df)/26),{LETTERS<-c(LETTERS, paste(LETTERS, LETTERS, sep=""))})
	}
	names(df)<-paste(LETTERS[1:length(df)])

	level.id.df<-function(df){
	level.id<-function(i){
	if(class(df[,i])=="factor" \| class(df[,i])=="character"){
	column<-paste(names(df)[i],as.numeric(as.factor(df[,i])), sep=".")}else if(is.numeric(df[,i])){
	column<-df[,i]/mean(df[,i], na.rm=T)}else{column<-df[,i]}
	return(column)}
	DF <- data.frame(sapply(seq_along(df), level.id))
	names(DF) <- names(df)
	return(DF)}
	df<-level.id.df(df)
	return(df)}

	test <- anonym(stulevel)
	head(test[, c(2:6, 28:32)])
	```

	That looks pretty generic and anonymized to me!

	#### Notes

	- Most of these solutions do not include missing data (NAs) which are often the
	source of problems in R. That limits their usefulness.
	- So, always check for NA values.

	### Creating the Example

	Once we have our minimal dataset, we need to reproduce our error on that dataset.
	This part is critical. If the error goes away when you apply your code to the
	minimal dataset, then it will be very hard for others to diagnose the problem
	remotely, and it might be time to get some "at your desk" help.

	Let's look at an example where we have an error aggregating data. Let's assume
	I am creating a new data frame for my example, and trying to aggregate that data
	by race.

	```{r aggregationproblems, comment=NA}
	Data <- data.frame(
	id = seq(1, 1000),
	gender = sample(c("male", "female"), 1000, replace = TRUE),
	mathSS = rnorm(1000, mean = 400, sd = 60),
	readSS = rnorm(1000, mean= 370, sd = 58.3),
	race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
	)

	myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
	head(myAgg)
	```

	Why do I get an error? Well, if you sent the above code to someone, they could
	quickly evaluate it for errors, and look at the mistake if they knew you were
	attempting to use the data.table package.

	```{r aggregationsolution, comment=NA, warning=FALSE}
	library(data.table)
	Data <- data.frame(
	id = seq(1, 1000),
	gender = sample(c("male", "female"), 1000, replace = TRUE),
	mathSS = rnorm(1000, mean = 400, sd = 60),
	readSS = rnorm(1000, mean= 370, sd = 58.3),
	race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
	)

	Data <- data.table(Data)
	myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
	head(myAgg)
	```

	### Session Info

	However, they might not know this, so we need to provide one final piece of
	information. This is known was the `sessionInfo` for our R session. To diagnose
	the error it is necessary to know what system you are running on, what packages
	are loaded in your workspace, and what version of R and a given package you are
	using.

	Thankfully, R makes this incredibly easy. Just tack on the output from the
	`sessionInfo()` function. This is easy enough to copy and paste or include in
	a `knitr` document.

	```{r sessioninfo, comment=NA}
	sessionInfo()
	```


	### Resources

	For more information, visit:

	- [http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)
	- [https://github.com/hadley/devtools/wiki/Reproducibility](https://github.com/hadley/devtools/wiki/Reproducibility)
	- [http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688](http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688)