Skip to content

Instantly share code, notes, and snippets.

@raffdoc
Forked from halpo/000-instructions.md
Created June 19, 2012 20:12
Show Gist options
  • Save raffdoc/2956267 to your computer and use it in GitHub Desktop.
Save raffdoc/2956267 to your computer and use it in GitHub Desktop.
harvestr R users conference presentation.

Building a beamer presentation with knitr.

Introduction

The documents included are the input for knitr. In addition you need to have the tool pandoc installed. I also use a custom beamer template to add the University of Utah \institute command to the template. It also changes the indentation some.

Steps

  1. knit document with
library(knitr)
knitr("harvestr.Rmd")

This step will require harvestr and it's dependencies to be installed in R.

  1. use pandoc to create beamer slides.
pandoc -t beamer --template=utah.beamer.tex -Vurl:1 -V theme:CambridgeUS -V colortheme:beaver  -V shorttitle:harvestr  harvestr.md -o harvestr.pdf

The final result

% Reproducible Parallel Simulations with Harvestr
% \href{mailto:[email protected]}{Andrew Redd}
% UseR! 2012
`ro dev='pdf', cache=T, warning=T, error=T, out.width="\\textwidth"
, fig.width=4, fig.height=3 or`
```{r ex1_0, include=F}
library(harvestr)
library(plyr)
library(ggplot2)
library(lme4)
library(dostats)
library(doParallel)
options(harvestr.time=F)
```
```{r ex3_parallel_setup, include=F, dependson='ex3', cache=F}
library(parallel)
library(doParallel)
cl <- makeCluster(2)
clusterEvalQ(cl, {
library(lme4)
library(harvestr)
})
clusterExport(cl, ls())
registerDoParallel(cl)
```
\includegraphics[width=\textwidth]{./Bill Gates lazy person.jpg)
# Reproducible Simulation
## Reproducibility
* Same script
* Same results
* Anywhere
+ Single thread
+ Multi-core
+ Cloud Scale
## Everything starts with a seed.
Simulation is based off Pseudo-random number generation (PRNG).
* PRNG is sequential, next number depends on the last state.
* Seeds are used to store the state of a random number generator
* by 'Setting a seed' one can place a PRNG into any exact state.
## Parallel Random Number Generation
Simulation is complicated in new parallel environments.
* PRNG is sequential,
* parallel execution is not,
* and order of execution is not guaranteed.
This is where parallel pseudo-random number generators help out.
## Parallel PRNG
Parallel pseudo-random number generators start with a singe state that
can spawn additional streams as well as streams of random numbers.
1. SPRNG
2. L'Ecuyer combined multiple-recursive generator
# Introducing `harvestr`
## R package `harvestr`
<https://github.com/halpo/harvestr>
What `harvestr` does:
* Reproducibility
* Caching
* Under parallelized environments.
## How `harvestr` works
* Analytical elements are separated into work-flows of dependent elements.
+ Set up environment/seed
+ Generate Data
+ Perform analysis
- Stochastic
- Non-Stochastic
+ Summarize
* Results from one step carry to another by carrying the seed with the results.
## **Primary work-flow** for `harvestr`
* `gather(n)` - generate `n` random number streams.
* `farm(seeds, expr)` - evaluate `expr` with each seed in `seeds`.
* `harvest(x, fun)` - for each data in `x` call the function `fun`
(based off `plyr`s `llply`).
\includegraphics[width=\textwidth]{./c0bfeebb.pdf}
## Example - Simple simulation
**Generate Data**
```{r ex1_1}
seeds <- gather(10, seed=20120614)
data <- farm(seeds, {
x1 <- rnorm(400)
x2 <- rnorm(400)
g <- rep(rnorm(4), each=100)
trt <- rep(1:4, each=100)
y <- rnorm(n=400, mean=3*x1 + x2 + g)
data.frame(y, x1, x2, trt)
})
```
-------
**Perform Analysis**
```{r ex1_2, dependson='ex1_1'}
analyses <- harvest(data, lmer, formula = y~x1+(1|trt))
analyses[[1]]
```
-------
**Recombine**
```{r ex1_3, dependson='ex1_2'}
results <- ldply(analyses, fixef)
qplot(data=results[,1:2], x=x1, y=`(Intercept)`)
```
## Stochastic Analysis in `harvestr`
* `gather` then `farm` as before.
* `graft` to generate seeds
\includegraphics[width=\textwidth]{./b837b118.pdf}
## Example 2 - Stochastic Analysis
**graft to obtain independent RNG sub-streams**
```{r ex2_1, dependson='ex1_2'}
branch <- graft(analyses[[1]], 5)
chains <- harvest(branch, mcmcsamp, 100)
mcmcfx <- harvest(chains, slot, 'fixef')
names(mcmcfx) <- 1:5
df2 <- ldply(llply(mcmcfx, t), as.data.frame)
```
------
```{r ex2_2, dependson='ex2_1'}
qplot(data=df2, x1, col=.id, geom=c('density'))
```
## Example 3 Chained.
```{r ex3, tidy=F, dependson='ex1_2'}
branches <- harvest(analyses, graft, 5)
chains <- harvest(branches
, harvest, mcmcsamp, n=100)
```
I'm really impatient and would like to do this in parallel.
## parallel
Just like `plyr` argument `.parallel`.
* uses [`plyr`](http://cran.r-project.org/package=plyr) and
[`foreach`](http://cran.r-project.org/package=foreach) parallel structures.
```{r ex3_parallel, warning=F, dependson='ex3.parallel.setup', tidy=F}
#! Parallel environment already setup.
system.time(chains <-
harvest(branches, harvest, mcmcsamp, n=100, .parallel=TRUE)
)
```
## Caching
Results can be made fault tolerant or interruptible by including caching.
Caching in `harvestr` indexes on
* data
* function
* seed
using the `digest` function.
-----
```{r rm_cache, include=F}
unlink("harvestr-cache", T, T)
```
```{r caching, dependson='rm_cache'}
seeds <- gather(100)
system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
head(dir('harvestr-cache'))
```
## So is it really reproducible?
```{r ex5, warning=F}
set.seed(123)
seeds <- gather(4)
a <- farm(seeds, runif(5))
b <- farm(seeds, runif(5))
c <- farm(seeds, runif(5), .parallel=T)
identical(a, b)
identical(a, c)
```
# Miscellaneous Extras
## Building blocks
Some building blocks that might *might* be helpful.
* `plant`- for setting up copies of an object with given seeds.
* `sprout` - for obtaining the sub-streams used with graft.
* `reap` - single object version of `harvest`
## In case you are wondering
* Yes it works with `Rcpp` code,
+ provided the compiled code uses the RNGScope for RNG in C++.
* **But** take care to not carry C++ reference objects across parallel calls.
------
\titlepage
```{r stop_cl, include=F}
stopCluster(cl)
```
\documentclass[$if(fontsize)$$fontsize$,$endif$$if(handout)$handout,$endif$$if(beamer)$ignorenonframetext,$endif$]{$documentclass$}
$if(theme)$
\usetheme{$theme$}
$endif$
$if(colortheme)$
\usecolortheme{$colortheme$}
$endif$
\usepackage{amssymb,amsmath}
% \usepackage{ifxetex,ifluatex}
% \ifxetex
% \usepackage{fontspec,xltxtra,xunicode}
% \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
% \else
% \ifluatex
% \usepackage{fontspec}
% \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
% \else
% \usepackage[utf8]{inputenc}
% \fi
% \fi
$if(natbib)$
\usepackage{natbib}
\bibliographystyle{plainnat}
$endif$
$if(biblatex)$
\usepackage{biblatex}
$if(biblio-files)$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(listings)$
\usepackage{listings}
$endif$
$if(lhs)$
\lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
$endif$
$if(highlighting-macros)$
$highlighting-macros$
$endif$
$if(verbatim-in-note)$
\usepackage{fancyvrb}
$endif$
$if(fancy-enums)$
% Redefine labelwidth for lists; otherwise, the enumerate package will cause
% markers to extend beyond the left margin.
% \makeatletter\AtBeginDocument{%
% \renewcommand{\@listi}
% {\setlength{\labelwidth}{4em}}
% }\makeatother
\usepackage{enumerate}
$endif$
$if(tables)$
\usepackage{ctable}
\usepackage{float} % provides the H option for float placement
$endif$
$if(url)$
\usepackage{url}
$endif$
$if(graphics)$
\usepackage{graphicx}
$endif$
% Comment these out if you don't want a slide with just the
% part/section/subsection/subsubsection title:
\AtBeginPart{\frame{\partpage}}
\AtBeginSection{\frame{\sectionpage}}
\AtBeginSubsection{\frame{\subsectionpage}}
\AtBeginSubsubsection{\frame{\subsubsectionpage}}
$if(strikeout)$
\usepackage[normalem]{ulem}
% avoid problems with \sout in headers with hyperref:
\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
$endif$
$if(subscript)$
\newcommand{\textsubscr}[1]{\ensuremath{_{\scriptsize\textrm{#1}}}}
$endif$
% \setlength{\parindent}{0pt}
% \setlength{\parskip}{6pt plus 2pt minus 1pt}
% \setlength{\emergencystretch}{3em} % prevent overfull lines
$if(numbersections)$
$else$
\setcounter{secnumdepth}{0}
$endif$
$if(verbatim-in-note)$
\VerbatimFootnotes % allows verbatim text in footnotes
$endif$
$if(lang)$
\usepackage[$lang$]{babel}
$endif$
$for(header-includes)$
$header-includes$
$endfor$
$if(title)$
$if(shorttitle)$
\title[$shorttitle$]{$title$}
$else$
\title{$title$}
$endif$
$endif$
$if(author)$
\author{$for(author)$$author$$sep$ \and $endfor$}
$endif$
$if(date)$
\date{$date$}
$endif$
\institute[Utah]{
University of Utah\\
School of Medicine
}
\setbeamertemplate{navigation symbols}{}
\begin{document}
$if(title)$
\frame{\titlepage}
$endif$
$for(include-before)$
$include-before$
$endfor$
$if(toc)$
\begin{frame}
\tableofcontents[hideallsubsections]
\end{frame}
$endif$
$body$
$if(natbib)$
$if(biblio-files)$
$if(biblio-title)$
$if(book-class)$
\renewcommand\bibname{$biblio-title$}
$else$
\renewcommand\refname{$biblio-title$}
$endif$
$endif$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(biblatex)$
\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$
$endif$
$for(include-after)$
$include-after$
$endfor$
\end{document}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment