raffdoc · June 19, 2012 20:12
diff --git a/000-instructions.md b/000-instructions.md
diff --git a/harvestr.Rmd b/harvestr.Rmd
 % Reproducible Parallel Simulations with Harvestr
 % \href{mailto:[email protected]}{Andrew Redd}
 % UseR! 2012

 `ro dev='pdf', cache=T, warning=T, error=T, out.width="\\textwidth"
    , fig.width=4, fig.height=3 or`
 ```{r ex1_0, include=F}
 library(harvestr)
 library(plyr)
 library(ggplot2)
 library(lme4)
 library(dostats)
 library(doParallel)
 options(harvestr.time=F)
 ```
 ```{r ex3_parallel_setup, include=F, dependson='ex3', cache=F}
 library(parallel)
 library(doParallel)
 cl <- makeCluster(2)
 clusterEvalQ(cl, {
    library(lme4)
    library(harvestr)
 })
 clusterExport(cl, ls())
 registerDoParallel(cl)
 ```

 \includegraphics[width=\textwidth]{./Bill Gates lazy person.jpg)



 # Reproducible Simulation

 ## Reproducibility 

 * Same script
 * Same results
 * Anywhere
    + Single thread
    + Multi-core
    + Cloud Scale

 ## Everything starts with a seed.
 Simulation is based off Pseudo-random number generation (PRNG).

 * PRNG is sequential, next number depends on the last state.
 * Seeds are used to store the state of a random number generator
 * by 'Setting a seed' one can place a PRNG into any exact state.

 ## Parallel Random Number Generation
 Simulation is complicated in new parallel environments.
 
 * PRNG is sequential,
 * parallel execution is not,
 * and order of execution is not guaranteed.

 This is where parallel pseudo-random number generators help out.

 ## Parallel PRNG
 Parallel pseudo-random number generators start with a singe state that
 can spawn additional streams as well as streams of random numbers.

 1. SPRNG
 2. L'Ecuyer combined multiple-recursive generator

 # Introducing `harvestr`

 ## R package `harvestr`
 <https://github.com/halpo/harvestr>

 What `harvestr` does:

 * Reproducibility
 * Caching
 * Under parallelized environments.



 ## How `harvestr` works

 * Analytical elements are separated into work-flows of dependent elements.
    + Set up environment/seed
    + Generate Data
    + Perform analysis
        - Stochastic
        - Non-Stochastic
    + Summarize
 * Results from one step carry to another by carrying the seed with the results.


 ## **Primary work-flow** for `harvestr`

 * `gather(n)` - generate `n` random number streams.
 * `farm(seeds, expr)` - evaluate `expr` with each seed in `seeds`.
 * `harvest(x, fun)` - for each data in `x` call the function `fun` 
  (based off `plyr`s `llply`).

 \includegraphics[width=\textwidth]{./c0bfeebb.pdf}

 ## Example - Simple simulation

 **Generate Data**

 ```{r ex1_1}
 seeds <- gather(10, seed=20120614)
 data  <- farm(seeds, {
    x1  <- rnorm(400)
    x2  <- rnorm(400)
    g   <- rep(rnorm(4), each=100)
    trt <- rep(1:4, each=100)
    y <- rnorm(n=400, mean=3*x1 + x2 + g)
    data.frame(y, x1, x2, trt)
 }) 
 ```

 -------

 **Perform Analysis**

 ```{r  ex1_2, dependson='ex1_1'}
 analyses <- harvest(data, lmer, formula = y~x1+(1|trt))
 analyses[[1]]
 ```

 -------

 **Recombine**

 ```{r ex1_3, dependson='ex1_2'}
  results <- ldply(analyses, fixef)
  qplot(data=results[,1:2], x=x1, y=`(Intercept)`)
 ```

 ## Stochastic Analysis in `harvestr`

 * `gather` then `farm` as before.
 * `graft` to generate seeds

 \includegraphics[width=\textwidth]{./b837b118.pdf}

 ## Example 2 - Stochastic Analysis
 **graft to obtain independent RNG sub-streams**

 ```{r ex2_1, dependson='ex1_2'}
 branch <- graft(analyses[[1]], 5)
 chains <- harvest(branch, mcmcsamp, 100)
 mcmcfx <- harvest(chains, slot, 'fixef')
 names(mcmcfx) <- 1:5
 df2 <- ldply(llply(mcmcfx, t), as.data.frame)
 ```

 ------

 ```{r ex2_2, dependson='ex2_1'}
 qplot(data=df2, x1, col=.id, geom=c('density'))
 ```

 ## Example 3 Chained.

 ```{r ex3, tidy=F, dependson='ex1_2'}
 branches <- harvest(analyses, graft, 5)
 chains <- harvest(branches
            , harvest, mcmcsamp, n=100)
 ```

 I'm really impatient and would like to do this in parallel.

 ## parallel

 Just like `plyr` argument `.parallel`.

 * uses [`plyr`](http://cran.r-project.org/package=plyr) and 
 [`foreach`](http://cran.r-project.org/package=foreach) parallel structures.

 ```{r ex3_parallel, warning=F, dependson='ex3.parallel.setup', tidy=F}
 #! Parallel environment already setup.
 system.time(chains <- 
  harvest(branches, harvest, mcmcsamp, n=100, .parallel=TRUE)
 )
 ```

 ## Caching

 Results can be made fault tolerant or interruptible by including caching.

 Caching in `harvestr` indexes on

 * data
 * function
 * seed

 using the `digest` function.

 -----
 ```{r rm_cache, include=F}
 unlink("harvestr-cache", T, T)
 ```

 ```{r caching, dependson='rm_cache'}
 seeds <- gather(100)
 system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
 system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
 head(dir('harvestr-cache'))
 ```

 ## So is it really reproducible?

 ```{r ex5, warning=F}
 set.seed(123)
 seeds <- gather(4)
 a <- farm(seeds, runif(5))
 b <- farm(seeds, runif(5))
 c <- farm(seeds, runif(5), .parallel=T)
 identical(a, b)
 identical(a, c)
 ```


 # Miscellaneous Extras

 ## Building blocks
 Some building blocks that might *might* be helpful.

 * `plant`- for setting up copies of an object with given seeds.
 * `sprout` - for obtaining the sub-streams used with graft.
 * `reap` - single object version of `harvest`


 ## In case you are wondering

 * Yes it works with `Rcpp` code,
    + provided the compiled code uses the RNGScope for RNG in C++.
 * **But** take care to not carry C++ reference objects across parallel calls.


 ------

 \titlepage

 ```{r stop_cl, include=F}
 stopCluster(cl)
 ```
diff --git a/utah.beamer.tex b/utah.beamer.tex
 \documentclass[$if(fontsize)$$fontsize$,$endif$$if(handout)$handout,$endif$$if(beamer)$ignorenonframetext,$endif$]{$documentclass$}
 $if(theme)$
 \usetheme{$theme$}
 $endif$
 $if(colortheme)$
 \usecolortheme{$colortheme$}
 $endif$
 \usepackage{amssymb,amsmath}
 % \usepackage{ifxetex,ifluatex}
 % \ifxetex
  % \usepackage{fontspec,xltxtra,xunicode}
  % \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
 % \else
  % \ifluatex
    % \usepackage{fontspec}
    % \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
  % \else
    % \usepackage[utf8]{inputenc}
  % \fi
 % \fi
 $if(natbib)$
 \usepackage{natbib}
 \bibliographystyle{plainnat}
 $endif$
 $if(biblatex)$
 \usepackage{biblatex}
 $if(biblio-files)$
 \bibliography{$biblio-files$}
 $endif$
 $endif$
 $if(listings)$
 \usepackage{listings}
 $endif$
 $if(lhs)$
 \lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
 $endif$
 $if(highlighting-macros)$
 $highlighting-macros$
 $endif$
 $if(verbatim-in-note)$
 \usepackage{fancyvrb}
 $endif$
 $if(fancy-enums)$
 % Redefine labelwidth for lists; otherwise, the enumerate package will cause
 % markers to extend beyond the left margin.
 % \makeatletter\AtBeginDocument{%
  % \renewcommand{\@listi}
    % {\setlength{\labelwidth}{4em}}
 % }\makeatother
 \usepackage{enumerate}
 $endif$
 $if(tables)$
 \usepackage{ctable}
 \usepackage{float} % provides the H option for float placement
 $endif$
 $if(url)$
 \usepackage{url}
 $endif$
 $if(graphics)$
 \usepackage{graphicx}
 $endif$
 % Comment these out if you don't want a slide with just the
 % part/section/subsection/subsubsection title:
 \AtBeginPart{\frame{\partpage}}
 \AtBeginSection{\frame{\sectionpage}}
 \AtBeginSubsection{\frame{\subsectionpage}}
 \AtBeginSubsubsection{\frame{\subsubsectionpage}}
 $if(strikeout)$
 \usepackage[normalem]{ulem}
 % avoid problems with \sout in headers with hyperref:
 \pdfstringdefDisableCommands{\renewcommand{\sout}{}}
 $endif$
 $if(subscript)$
 \newcommand{\textsubscr}[1]{\ensuremath{_{\scriptsize\textrm{#1}}}}
 $endif$
 % \setlength{\parindent}{0pt}
 % \setlength{\parskip}{6pt plus 2pt minus 1pt}
 % \setlength{\emergencystretch}{3em}  % prevent overfull lines
 $if(numbersections)$
 $else$
 \setcounter{secnumdepth}{0}
 $endif$
 $if(verbatim-in-note)$
 \VerbatimFootnotes % allows verbatim text in footnotes
 $endif$
 $if(lang)$
 \usepackage[$lang$]{babel}
 $endif$
 $for(header-includes)$
 $header-includes$
 $endfor$

 $if(title)$
 $if(shorttitle)$
 \title[$shorttitle$]{$title$}
 $else$
 \title{$title$}
 $endif$
 $endif$
 $if(author)$
 \author{$for(author)$$author$$sep$ \and $endfor$}
 $endif$
 $if(date)$
 \date{$date$}
 $endif$
 \institute[Utah]{
 University of Utah\\
 School of Medicine
 }
 \setbeamertemplate{navigation symbols}{}
 \begin{document}
 $if(title)$
 \frame{\titlepage}
 $endif$

 $for(include-before)$
 $include-before$

 $endfor$
 $if(toc)$
 \begin{frame}
 \tableofcontents[hideallsubsections]
 \end{frame}

 $endif$
 $body$

 $if(natbib)$
 $if(biblio-files)$
 $if(biblio-title)$
 $if(book-class)$
 \renewcommand\bibname{$biblio-title$}
 $else$
 \renewcommand\refname{$biblio-title$}
 $endif$
 $endif$
 \bibliography{$biblio-files$}

 $endif$
 $endif$
 $if(biblatex)$
 \printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$

 $endif$
 $for(include-after)$
 $include-after$

 $endfor$
 \end{document}
	% Reproducible Parallel Simulations with Harvestr
	% \href{mailto:[email protected]}{Andrew Redd}
	% UseR! 2012

	`ro dev='pdf', cache=T, warning=T, error=T, out.width="\\textwidth"
	, fig.width=4, fig.height=3 or`
	```{r ex1_0, include=F}
	library(harvestr)
	library(plyr)
	library(ggplot2)
	library(lme4)
	library(dostats)
	library(doParallel)
	options(harvestr.time=F)
	```
	```{r ex3_parallel_setup, include=F, dependson='ex3', cache=F}
	library(parallel)
	library(doParallel)
	cl <- makeCluster(2)
	clusterEvalQ(cl, {
	library(lme4)
	library(harvestr)
	})
	clusterExport(cl, ls())
	registerDoParallel(cl)
	```

	\includegraphics[width=\textwidth]{./Bill Gates lazy person.jpg)



	# Reproducible Simulation

	## Reproducibility

	* Same script
	* Same results
	* Anywhere
	+ Single thread
	+ Multi-core
	+ Cloud Scale

	## Everything starts with a seed.
	Simulation is based off Pseudo-random number generation (PRNG).

	* PRNG is sequential, next number depends on the last state.
	* Seeds are used to store the state of a random number generator
	* by 'Setting a seed' one can place a PRNG into any exact state.

	## Parallel Random Number Generation
	Simulation is complicated in new parallel environments.

	* PRNG is sequential,
	* parallel execution is not,
	* and order of execution is not guaranteed.

	This is where parallel pseudo-random number generators help out.

	## Parallel PRNG
	Parallel pseudo-random number generators start with a singe state that
	can spawn additional streams as well as streams of random numbers.

	1. SPRNG
	2. L'Ecuyer combined multiple-recursive generator

	# Introducing `harvestr`

	## R package `harvestr`
	<https://github.com/halpo/harvestr>

	What `harvestr` does:

	* Reproducibility
	* Caching
	* Under parallelized environments.



	## How `harvestr` works

	* Analytical elements are separated into work-flows of dependent elements.
	+ Set up environment/seed
	+ Generate Data
	+ Perform analysis
	- Stochastic
	- Non-Stochastic
	+ Summarize
	* Results from one step carry to another by carrying the seed with the results.


	## Primary work-flow for `harvestr`

	* `gather(n)` - generate `n` random number streams.
	* `farm(seeds, expr)` - evaluate `expr` with each seed in `seeds`.
	* `harvest(x, fun)` - for each data in `x` call the function `fun`
	(based off `plyr`s `llply`).

	\includegraphics[width=\textwidth]{./c0bfeebb.pdf}

	## Example - Simple simulation

	Generate Data

	```{r ex1_1}
	seeds <- gather(10, seed=20120614)
	data <- farm(seeds, {
	x1 <- rnorm(400)
	x2 <- rnorm(400)
	g <- rep(rnorm(4), each=100)
	trt <- rep(1:4, each=100)
	y <- rnorm(n=400, mean=3*x1 + x2 + g)
	data.frame(y, x1, x2, trt)
	})
	```

	-------

	Perform Analysis

	```{r ex1_2, dependson='ex1_1'}
	analyses <- harvest(data, lmer, formula = y~x1+(1\|trt))
	analyses[[1]]
	```

	-------

	Recombine

	```{r ex1_3, dependson='ex1_2'}
	results <- ldply(analyses, fixef)
	qplot(data=results[,1:2], x=x1, y=`(Intercept)`)
	```

	## Stochastic Analysis in `harvestr`

	* `gather` then `farm` as before.
	* `graft` to generate seeds

	\includegraphics[width=\textwidth]{./b837b118.pdf}

	## Example 2 - Stochastic Analysis
	graft to obtain independent RNG sub-streams

	```{r ex2_1, dependson='ex1_2'}
	branch <- graft(analyses[[1]], 5)
	chains <- harvest(branch, mcmcsamp, 100)
	mcmcfx <- harvest(chains, slot, 'fixef')
	names(mcmcfx) <- 1:5
	df2 <- ldply(llply(mcmcfx, t), as.data.frame)
	```

	------

	```{r ex2_2, dependson='ex2_1'}
	qplot(data=df2, x1, col=.id, geom=c('density'))
	```

	## Example 3 Chained.

	```{r ex3, tidy=F, dependson='ex1_2'}
	branches <- harvest(analyses, graft, 5)
	chains <- harvest(branches
	, harvest, mcmcsamp, n=100)
	```

	I'm really impatient and would like to do this in parallel.

	## parallel

	Just like `plyr` argument `.parallel`.

	* uses [`plyr`](http://cran.r-project.org/package=plyr) and
	[`foreach`](http://cran.r-project.org/package=foreach) parallel structures.

	```{r ex3_parallel, warning=F, dependson='ex3.parallel.setup', tidy=F}
	#! Parallel environment already setup.
	system.time(chains <-
	harvest(branches, harvest, mcmcsamp, n=100, .parallel=TRUE)
	)
	```

	## Caching

	Results can be made fault tolerant or interruptible by including caching.

	Caching in `harvestr` indexes on

	* data
	* function
	* seed

	using the `digest` function.

	-----
	```{r rm_cache, include=F}
	unlink("harvestr-cache", T, T)
	```

	```{r caching, dependson='rm_cache'}
	seeds <- gather(100)
	system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
	system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
	head(dir('harvestr-cache'))
	```

	## So is it really reproducible?

	```{r ex5, warning=F}
	set.seed(123)
	seeds <- gather(4)
	a <- farm(seeds, runif(5))
	b <- farm(seeds, runif(5))
	c <- farm(seeds, runif(5), .parallel=T)
	identical(a, b)
	identical(a, c)
	```


	# Miscellaneous Extras

	## Building blocks
	Some building blocks that might might be helpful.

	* `plant`- for setting up copies of an object with given seeds.
	* `sprout` - for obtaining the sub-streams used with graft.
	* `reap` - single object version of `harvest`


	## In case you are wondering

	* Yes it works with `Rcpp` code,
	+ provided the compiled code uses the RNGScope for RNG in C++.
	* But take care to not carry C++ reference objects across parallel calls.


	------

	\titlepage

	```{r stop_cl, include=F}
	stopCluster(cl)
	```
	\documentclass[$if(fontsize)$$fontsize$,$endif$$if(handout)$handout,$endif$$if(beamer)$ignorenonframetext,$endif$]{$documentclass$}
	$if(theme)$
	\usetheme{$theme$}
	$endif$
	$if(colortheme)$
	\usecolortheme{$colortheme$}
	$endif$
	\usepackage{amssymb,amsmath}
	% \usepackage{ifxetex,ifluatex}
	% \ifxetex
	% \usepackage{fontspec,xltxtra,xunicode}
	% \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
	% \else
	% \ifluatex
	% \usepackage{fontspec}
	% \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
	% \else
	% \usepackage[utf8]{inputenc}
	% \fi
	% \fi
	$if(natbib)$
	\usepackage{natbib}
	\bibliographystyle{plainnat}
	$endif$
	$if(biblatex)$
	\usepackage{biblatex}
	$if(biblio-files)$
	\bibliography{$biblio-files$}
	$endif$
	$endif$
	$if(listings)$
	\usepackage{listings}
	$endif$
	$if(lhs)$
	\lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
	$endif$
	$if(highlighting-macros)$
	$highlighting-macros$
	$endif$
	$if(verbatim-in-note)$
	\usepackage{fancyvrb}
	$endif$
	$if(fancy-enums)$
	% Redefine labelwidth for lists; otherwise, the enumerate package will cause
	% markers to extend beyond the left margin.
	% \makeatletter\AtBeginDocument{%
	% \renewcommand{\@listi}
	% {\setlength{\labelwidth}{4em}}
	% }\makeatother
	\usepackage{enumerate}
	$endif$
	$if(tables)$
	\usepackage{ctable}
	\usepackage{float} % provides the H option for float placement
	$endif$
	$if(url)$
	\usepackage{url}
	$endif$
	$if(graphics)$
	\usepackage{graphicx}
	$endif$
	% Comment these out if you don't want a slide with just the
	% part/section/subsection/subsubsection title:
	\AtBeginPart{\frame{\partpage}}
	\AtBeginSection{\frame{\sectionpage}}
	\AtBeginSubsection{\frame{\subsectionpage}}
	\AtBeginSubsubsection{\frame{\subsubsectionpage}}
	$if(strikeout)$
	\usepackage[normalem]{ulem}
	% avoid problems with \sout in headers with hyperref:
	\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
	$endif$
	$if(subscript)$
	\newcommand{\textsubscr}[1]{\ensuremath{_{\scriptsize\textrm{#1}}}}
	$endif$
	% \setlength{\parindent}{0pt}
	% \setlength{\parskip}{6pt plus 2pt minus 1pt}
	% \setlength{\emergencystretch}{3em} % prevent overfull lines
	$if(numbersections)$
	$else$
	\setcounter{secnumdepth}{0}
	$endif$
	$if(verbatim-in-note)$
	\VerbatimFootnotes % allows verbatim text in footnotes
	$endif$
	$if(lang)$
	\usepackage[$lang$]{babel}
	$endif$
	$for(header-includes)$
	$header-includes$
	$endfor$

	$if(title)$
	$if(shorttitle)$
	\title[$shorttitle$]{$title$}
	$else$
	\title{$title$}
	$endif$
	$endif$
	$if(author)$
	\author{$for(author)$$author$$sep$ \and $endfor$}
	$endif$
	$if(date)$
	\date{$date$}
	$endif$
	\institute[Utah]{
	University of Utah\\
	School of Medicine
	}
	\setbeamertemplate{navigation symbols}{}
	\begin{document}
	$if(title)$
	\frame{\titlepage}
	$endif$

	$for(include-before)$
	$include-before$

	$endfor$
	$if(toc)$
	\begin{frame}
	\tableofcontents[hideallsubsections]
	\end{frame}

	$endif$
	$body$

	$if(natbib)$
	$if(biblio-files)$
	$if(biblio-title)$
	$if(book-class)$
	\renewcommand\bibname{$biblio-title$}
	$else$
	\renewcommand\refname{$biblio-title$}
	$endif$
	$endif$
	\bibliography{$biblio-files$}

	$endif$
	$endif$
	$if(biblatex)$
	\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$

	$endif$
	$for(include-after)$
	$include-after$

	$endfor$
	\end{document}