|
% Reproducible Parallel Simulations with Harvestr |
|
% \href{mailto:[email protected]}{Andrew Redd} |
|
% UseR! 2012 |
|
|
|
`ro dev='pdf', cache=T, warning=T, error=T, out.width="\\textwidth" |
|
, fig.width=4, fig.height=3 or` |
|
```{r ex1_0, include=F} |
|
library(harvestr) |
|
library(plyr) |
|
library(ggplot2) |
|
library(lme4) |
|
library(dostats) |
|
library(doParallel) |
|
options(harvestr.time=F) |
|
``` |
|
```{r ex3_parallel_setup, include=F, dependson='ex3', cache=F} |
|
library(parallel) |
|
library(doParallel) |
|
cl <- makeCluster(2) |
|
clusterEvalQ(cl, { |
|
library(lme4) |
|
library(harvestr) |
|
}) |
|
clusterExport(cl, ls()) |
|
registerDoParallel(cl) |
|
``` |
|
|
|
\includegraphics[width=\textwidth]{./Bill Gates lazy person.jpg) |
|
|
|
|
|
|
|
# Reproducible Simulation |
|
|
|
## Reproducibility |
|
|
|
* Same script |
|
* Same results |
|
* Anywhere |
|
+ Single thread |
|
+ Multi-core |
|
+ Cloud Scale |
|
|
|
## Everything starts with a seed. |
|
Simulation is based off Pseudo-random number generation (PRNG). |
|
|
|
* PRNG is sequential, next number depends on the last state. |
|
* Seeds are used to store the state of a random number generator |
|
* by 'Setting a seed' one can place a PRNG into any exact state. |
|
|
|
## Parallel Random Number Generation |
|
Simulation is complicated in new parallel environments. |
|
|
|
* PRNG is sequential, |
|
* parallel execution is not, |
|
* and order of execution is not guaranteed. |
|
|
|
This is where parallel pseudo-random number generators help out. |
|
|
|
## Parallel PRNG |
|
Parallel pseudo-random number generators start with a singe state that |
|
can spawn additional streams as well as streams of random numbers. |
|
|
|
1. SPRNG |
|
2. L'Ecuyer combined multiple-recursive generator |
|
|
|
# Introducing `harvestr` |
|
|
|
## R package `harvestr` |
|
<https://github.com/halpo/harvestr> |
|
|
|
What `harvestr` does: |
|
|
|
* Reproducibility |
|
* Caching |
|
* Under parallelized environments. |
|
|
|
|
|
|
|
## How `harvestr` works |
|
|
|
* Analytical elements are separated into work-flows of dependent elements. |
|
+ Set up environment/seed |
|
+ Generate Data |
|
+ Perform analysis |
|
- Stochastic |
|
- Non-Stochastic |
|
+ Summarize |
|
* Results from one step carry to another by carrying the seed with the results. |
|
|
|
|
|
## **Primary work-flow** for `harvestr` |
|
|
|
* `gather(n)` - generate `n` random number streams. |
|
* `farm(seeds, expr)` - evaluate `expr` with each seed in `seeds`. |
|
* `harvest(x, fun)` - for each data in `x` call the function `fun` |
|
(based off `plyr`s `llply`). |
|
|
|
\includegraphics[width=\textwidth]{./c0bfeebb.pdf} |
|
|
|
## Example - Simple simulation |
|
|
|
**Generate Data** |
|
|
|
```{r ex1_1} |
|
seeds <- gather(10, seed=20120614) |
|
data <- farm(seeds, { |
|
x1 <- rnorm(400) |
|
x2 <- rnorm(400) |
|
g <- rep(rnorm(4), each=100) |
|
trt <- rep(1:4, each=100) |
|
y <- rnorm(n=400, mean=3*x1 + x2 + g) |
|
data.frame(y, x1, x2, trt) |
|
}) |
|
``` |
|
|
|
------- |
|
|
|
**Perform Analysis** |
|
|
|
```{r ex1_2, dependson='ex1_1'} |
|
analyses <- harvest(data, lmer, formula = y~x1+(1|trt)) |
|
analyses[[1]] |
|
``` |
|
|
|
------- |
|
|
|
**Recombine** |
|
|
|
```{r ex1_3, dependson='ex1_2'} |
|
results <- ldply(analyses, fixef) |
|
qplot(data=results[,1:2], x=x1, y=`(Intercept)`) |
|
``` |
|
|
|
## Stochastic Analysis in `harvestr` |
|
|
|
* `gather` then `farm` as before. |
|
* `graft` to generate seeds |
|
|
|
\includegraphics[width=\textwidth]{./b837b118.pdf} |
|
|
|
## Example 2 - Stochastic Analysis |
|
**graft to obtain independent RNG sub-streams** |
|
|
|
```{r ex2_1, dependson='ex1_2'} |
|
branch <- graft(analyses[[1]], 5) |
|
chains <- harvest(branch, mcmcsamp, 100) |
|
mcmcfx <- harvest(chains, slot, 'fixef') |
|
names(mcmcfx) <- 1:5 |
|
df2 <- ldply(llply(mcmcfx, t), as.data.frame) |
|
``` |
|
|
|
------ |
|
|
|
```{r ex2_2, dependson='ex2_1'} |
|
qplot(data=df2, x1, col=.id, geom=c('density')) |
|
``` |
|
|
|
## Example 3 Chained. |
|
|
|
```{r ex3, tidy=F, dependson='ex1_2'} |
|
branches <- harvest(analyses, graft, 5) |
|
chains <- harvest(branches |
|
, harvest, mcmcsamp, n=100) |
|
``` |
|
|
|
I'm really impatient and would like to do this in parallel. |
|
|
|
## parallel |
|
|
|
Just like `plyr` argument `.parallel`. |
|
|
|
* uses [`plyr`](http://cran.r-project.org/package=plyr) and |
|
[`foreach`](http://cran.r-project.org/package=foreach) parallel structures. |
|
|
|
```{r ex3_parallel, warning=F, dependson='ex3.parallel.setup', tidy=F} |
|
#! Parallel environment already setup. |
|
system.time(chains <- |
|
harvest(branches, harvest, mcmcsamp, n=100, .parallel=TRUE) |
|
) |
|
``` |
|
|
|
## Caching |
|
|
|
Results can be made fault tolerant or interruptible by including caching. |
|
|
|
Caching in `harvestr` indexes on |
|
|
|
* data |
|
* function |
|
* seed |
|
|
|
using the `digest` function. |
|
|
|
----- |
|
```{r rm_cache, include=F} |
|
unlink("harvestr-cache", T, T) |
|
``` |
|
|
|
```{r caching, dependson='rm_cache'} |
|
seeds <- gather(100) |
|
system.time(farm(seeds, tail(rnorm(5e5)), cache=T)) |
|
system.time(farm(seeds, tail(rnorm(5e5)), cache=T)) |
|
head(dir('harvestr-cache')) |
|
``` |
|
|
|
## So is it really reproducible? |
|
|
|
```{r ex5, warning=F} |
|
set.seed(123) |
|
seeds <- gather(4) |
|
a <- farm(seeds, runif(5)) |
|
b <- farm(seeds, runif(5)) |
|
c <- farm(seeds, runif(5), .parallel=T) |
|
identical(a, b) |
|
identical(a, c) |
|
``` |
|
|
|
|
|
# Miscellaneous Extras |
|
|
|
## Building blocks |
|
Some building blocks that might *might* be helpful. |
|
|
|
* `plant`- for setting up copies of an object with given seeds. |
|
* `sprout` - for obtaining the sub-streams used with graft. |
|
* `reap` - single object version of `harvest` |
|
|
|
|
|
## In case you are wondering |
|
|
|
* Yes it works with `Rcpp` code, |
|
+ provided the compiled code uses the RNGScope for RNG in C++. |
|
* **But** take care to not carry C++ reference objects across parallel calls. |
|
|
|
|
|
------ |
|
|
|
\titlepage |
|
|
|
```{r stop_cl, include=F} |
|
stopCluster(cl) |
|
``` |
This is a great template. Am having problems with:
Error in try(fun(x, ...), getOption("harvestr.try.silent", FALSE)) :
object 'mcmcsamp' not found
both harvestr and lme4 are installed...