jepusto · September 28, 2016 18:16
diff --git a/simulation-in-R-2016 b/simulation-in-R-2016
 ---
 title: "Designing simulation studies in R"
 author: "James E. Pustejovsky"
 date: "September 28, 2016"
 output:
  ioslides_presentation:
    css: custom.css
    widescreen: true
    transition: faster
 ---

 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5)
 ```

 ## Simulation

 >1. Generate data from a __probability model__ where you know the true parameters.
 >2. Apply one or more estimation procedures to the data, record the results.
 >3. Repeat many times ("in repeated samples...").
 >4. Compare the results to the truth (bias, accuracy, error/coverage rates).

 ## Why is simulation useful?

 >* Building understanding & intuition about statistical models
 >* Assessing/checking modeling strategies
 >* Formal research

 ## Modeling batting averages

 h/t David Robinson for this example (more details [on his blog](http://varianceexplained.org/r/beta_binomial_baseball/))

 <div class="columns-2">
 <img src="babe-ruth.jpg" height="400px" />

 >* Baseball batting average = # hits / # at-bats
 >* Correlated with # at-bats because better hitters get more opportunities
 >* Beta-binomial regression is a useful way to model this relationship 

 </div>


 ## Batting averages vs. at-bats

 ```{r}
 library(dplyr)
 library(tidyr)
 library(Lahman)
 library(ggplot2)

 pitchers <- Pitching %>%
  group_by(playerID) %>%
  summarize(gamesPitched = sum(G)) %>%
  filter(gamesPitched > 3)

 career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(pitchers, by = "playerID") %>%
  group_by(playerID) %>%
  summarize_each(funs(sum), H, AB) %>%
  mutate(average = H / AB)

 career <- Master %>%
  tbl_df() %>%
  select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID")

 career %>%
  filter(AB >= 20) %>%
  ggplot(aes(AB, average)) +
  geom_point(alpha = 0.3) +
  geom_smooth() +
  scale_x_log10() + theme_minimal() + 
  labs(x = "Career at-bats", y = "Career batting-average")

 ```

 ```{r, eval = FALSE, include = FALSE}
 BA_betabin <- gamlss(cbind(H, AB - H) ~ log(AB),
                 data = career, family = BB(mu.link = "identity", sigma.link = "inverse"))
 summary(BA_betabin)

 ```

 Source: [Lahman's Baseball Database, 1871-2015](http://www.seanlahman.com/baseball-archive/statistics/) (R package `Lahman`)

 ## Beta-binomial regression

 >* $Y_i$ is number of hits by player $i$
 >* $n_i$ is number of at-bats
 >* $Y_i \sim Binomial(n_i, \pi_i)$, where $p_i$ is true batting ability of player $i$
 >* $\pi_i \sim Beta\left(\mu_{n_i} / \sigma, (1 - \mu_{n_i}) / \sigma\right)$, where $\mu_{n_i}$ is average batting ability of players with $n_i$ at-bats and $\sigma$ describes the variability in true ability.
 >* $\mu_{n_i} = \beta_0 + \beta_1 \log(n_i)$

 ## Simulate to build understanding 

 ```{r}
 career_at_bats <- career$AB
 ```

 ```{r, echo = TRUE}
 set.seed(20160928)
 players <- 1000
 at_bats <- sample(career_at_bats, size = players)
 summary(at_bats)
 mu <- 0.140 + 0.015 * log(at_bats)
 summary(mu)
 ```

 ## Simulate true abilities

 ```{r, echo = TRUE}
 sigma <- 1 / 500
 ability <- rbeta(players, 
                 shape1 = mu / sigma, 
                 shape2 = (1 - mu) / sigma)
 ```

 ## Simulate true abilities

 ```{r}
 dat <- data_frame(at_bats, mu, ability)
 ggplot(dat, aes(at_bats, ability)) + 
  geom_point(alpha = 0.3) +
  geom_smooth() +
  scale_x_log10() + theme_minimal() + 
  ylim(0, 0.6) + 
  labs(x = "Career at-bats", y = "Batting ability")
 ```

 ## Simulate batting averages

 ```{r, echo = TRUE}
 dat$hits <- with(dat, rbinom(n = players, size = at_bats, prob = ability))
 dat$batting_avg <- with(dat, hits / at_bats)
 ```

 ## Simulate batting averages

 ```{r}
 ggplot(dat, aes(at_bats, batting_avg)) + 
  geom_point(alpha = 0.3, color = "purple") +
  geom_smooth() +
  scale_x_log10() + theme_minimal() + 
  ylim(0, 0.6) + 
  labs(x = "Career at-bats", y = "Observed batting average")
 ```

 ## Fit the beta-binomial regression

 ```{r, echo = TRUE, results = "hide"}
 library(gamlss)
 bb_fit <- gamlss(cbind(hits, at_bats - hits) ~ log(at_bats),
                 data = dat, family = BB(mu.link = "identity"))
 coef(bb_fit)
 ```
 ```{r}
 coef(bb_fit)
 ```

 ## Simulate to check modeling strategies

 >* In real-world data analysis, there are __almost always__ multiple ways to approach a problem.
 >* Small simulations are a useful way to test out strategies for use in a given setting.
 >* For modeling batting averages, beta-binomial regression is useful but __SLOW__.
 >* Would it work to use a __quasi-binomial glm__ instead?

 ## Data-generating function

 ```{r, echo = TRUE}
 simulate_batting_avgs <- function(players, beta, sigma) {
  at_bats <- sample(career_at_bats, size = players)
  mu <- beta[1] + beta[2] * log(at_bats)
  ability <- rbeta(players, 
                   shape1 = mu / sigma, 
                   shape2 = (1 - mu) / sigma)
  hits <- rbinom(n = players, size = at_bats, prob = ability)
  data_frame(at_bats, hits)
 }

 new_dat <- simulate_batting_avgs(players = 400, 
                                 beta = c(0.140, 0.015), 
                                 sigma = 1 / 500)
 ```

 ## Data-generating function

 ```{r, echo = TRUE}
 new_dat
 ```

 ## Modeling function

 ```{r, echo = TRUE}
 quasibinomial_CI <- function(dat, level = 0.95) {
  glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
                 family = quasibinomial(link = "identity"))
  b <- coef(glm_fit)
  se <- sqrt(diag(vcov(glm_fit)))
  crit <- qnorm(1 - (1 - level) / 2)
  data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
 }

 quasibinomial_CI(new_dat)
 ```

 ## Put them together!

 ```{r, echo = TRUE}
 lots_of_CIs <- 
  replicate(2000, {
    dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500)
    quasibinomial_CI(dat)
  }, simplify = FALSE)

 ```

 ## Confidence interval coverage 

 ```{r}
 CI_dat <- 
  bind_rows(lots_of_CIs) %>%
  group_by(term) %>%
  mutate(n = row_number())

 true_dat <- data_frame(term = c("(Intercept)","log(at_bats)"), val = c(0.140, 0.015))
 CI_dat %>% filter(n < 200) %>%
 ggplot() + 
  geom_segment(aes(x = n, xend = n, y = L, yend = U)) + 
  geom_hline(dat = true_dat, aes(yintercept = val), col = "blue") + 
  facet_wrap(~ term, ncol = 1, scales = "free") + 
  theme_minimal() + 
  labs(x = "Iteration", y = "Confidence interval")
 ```

 ## Confidence interval coverage 

 ```{r, echo = TRUE}
 bind_rows(lots_of_CIs) %>%
  left_join(true_dat, by = "term") %>%
  mutate(covered = L < val & val < U) %>%
  group_by(term) %>%
  summarise(coverage = mean(covered))
 ```

 ```{r, eval = FALSE, include = FALSE}
 library(sandwich)
 robust_quasibinomial_CI <- function(dat, level = 0.95) {
  glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
                 family = quasibinomial(link = "identity"))
  b <- coef(glm_fit)
  se <- sqrt(diag(vcovHC(glm_fit)))
  crit <- qnorm(1 - (1 - level) / 2)
  data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
 }

 lots_of_robust_CIs <- 
  replicate(2000, {
    dat <- simulate_batting_avgs(players = 400, 
                                 beta = c(0.140, 0.015), 
                                 sigma = 1 / 500)
    robust_quasibinomial_CI(dat)
  }, simplify = FALSE)

 bind_rows(lots_of_robust_CIs) %>%
  left_join(true_dat, by = "term") %>%
  mutate(covered = L < val & val < U) %>%
  group_by(term) %>%
  summarise(coverage = mean(covered))

 ```

 ## Simulation studies in formal research

 Questions about quantitative methodology:

 > * Which type of confidence intervals should be used for the beta-binomial model?
 > * Which of these twelve tests should I use for one-way ANOVA when the variances are non-homogeneous?
 > * How big a sample is needed to get accurate estimates of variance components in a multi-level logistic regression model?
 > * Is it reasonable to use a multivariate normal model to impute missing data, even though the variables look skewed?

 ## Why focus on simulation studies? 

 Few alternatives for assessing 

 >* small-sample performance of estimation methods
 >* performance of combinations of methods (data analysis pipelines)
 >* robustness under model mis-specification
 >* comparison of competing methods

 ## Simulation design

 ```{r, fig.width = 10, fig.height = 6, out.width = "800px"}
 library(diagram)
 par(mar = c(0.1, 0.1, 0.1, 0.1))
 openplotmat()
 elpos <- coordinates(c(2,1,2))
 fromto <- matrix(ncol = 2, byrow = TRUE,
                 data = c(4, 1, 1, 2, 2, 3, 3, 5))
 nr <- nrow(fromto)
 arrpos <- matrix(ncol = 2, nrow = nr)
 for (i in 1:nr) {
 arrpos[i, ] <- straightarrow(from = elpos[fromto[i, 1], ],
                               to = elpos[fromto[i, 2], ],
                               lwd = 2, arr.pos = 0.6, arr.length = 0.5) 
 }
 
 box_dat <- data_frame(lab = c("Data-generating model",
                              "Estimation methods",
                              "Performance criteria",
                              "Experimental design",
                              "Results"),
                      col = c("lightblue","lightgreen","yellow","orange","red"))

 for (i in 1:nrow(box_dat)) {
  textrect(elpos[i,], 0.2, 0.1, lab = box_dat$lab[i],
           box.col = box_dat$col[i], shadow.size = 0, cex = 2)
 }

 ```

 ## Simulation design strategy

 * Write separate functions for each component of the simulation
    * Makes it easier to debug, modify, or re-use code
 * Test each component
 * Run in parallel where possible

 ## Learning more

 * Spring, 2017: Data Analysis, Simulation, & Programming in R
 * My blog: http://jepusto.github.io/
    * code for today's examples
    * lots of other examples
    * [another lecture on designing simulations](http://jepusto.github.io/Designing-simulation-studies-using-R) 
 * Ask faculty for articles with good simulation studies
 * November 9th QM colloquium: Anita Israni on "Running Simulations on the Texas Advanced Computing Cluster"
	---
	title: "Designing simulation studies in R"
	author: "James E. Pustejovsky"
	date: "September 28, 2016"
	output:
	ioslides_presentation:
	css: custom.css
	widescreen: true
	transition: faster
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5)
	```

	## Simulation

	>1. Generate data from a __probability model__ where you know the true parameters.
	>2. Apply one or more estimation procedures to the data, record the results.
	>3. Repeat many times ("in repeated samples...").
	>4. Compare the results to the truth (bias, accuracy, error/coverage rates).

	## Why is simulation useful?

	>* Building understanding & intuition about statistical models
	>* Assessing/checking modeling strategies
	>* Formal research

	## Modeling batting averages

	h/t David Robinson for this example (more details [on his blog](http://varianceexplained.org/r/beta_binomial_baseball/))

	<div class="columns-2">
	<img src="babe-ruth.jpg" height="400px" />

	>* Baseball batting average = # hits / # at-bats
	>* Correlated with # at-bats because better hitters get more opportunities
	>* Beta-binomial regression is a useful way to model this relationship

	</div>


	## Batting averages vs. at-bats

	```{r}
	library(dplyr)
	library(tidyr)
	library(Lahman)
	library(ggplot2)

	pitchers <- Pitching %>%
	group_by(playerID) %>%
	summarize(gamesPitched = sum(G)) %>%
	filter(gamesPitched > 3)

	career <- Batting %>%
	filter(AB > 0) %>%
	anti_join(pitchers, by = "playerID") %>%
	group_by(playerID) %>%
	summarize_each(funs(sum), H, AB) %>%
	mutate(average = H / AB)

	career <- Master %>%
	tbl_df() %>%
	select(playerID, nameFirst, nameLast) %>%
	unite(name, nameFirst, nameLast, sep = " ") %>%
	inner_join(career, by = "playerID")

	career %>%
	filter(AB >= 20) %>%
	ggplot(aes(AB, average)) +
	geom_point(alpha = 0.3) +
	geom_smooth() +
	scale_x_log10() + theme_minimal() +
	labs(x = "Career at-bats", y = "Career batting-average")

	```

	```{r, eval = FALSE, include = FALSE}
	BA_betabin <- gamlss(cbind(H, AB - H) ~ log(AB),
	data = career, family = BB(mu.link = "identity", sigma.link = "inverse"))
	summary(BA_betabin)

	```

	Source: [Lahman's Baseball Database, 1871-2015](http://www.seanlahman.com/baseball-archive/statistics/) (R package `Lahman`)

	## Beta-binomial regression

	>* $Y_i$ is number of hits by player $i$
	>* $n_i$ is number of at-bats
	>* $Y_i \sim Binomial(n_i, \pi_i)$, where $p_i$ is true batting ability of player $i$
	>* $\pi_i \sim Beta\left(\mu_{n_i} / \sigma, (1 - \mu_{n_i}) / \sigma\right)$, where $\mu_{n_i}$ is average batting ability of players with $n_i$ at-bats and $\sigma$ describes the variability in true ability.
	>* $\mu_{n_i} = \beta_0 + \beta_1 \log(n_i)$

	## Simulate to build understanding

	```{r}
	career_at_bats <- career$AB
	```

	```{r, echo = TRUE}
	set.seed(20160928)
	players <- 1000
	at_bats <- sample(career_at_bats, size = players)
	summary(at_bats)
	mu <- 0.140 + 0.015 * log(at_bats)
	summary(mu)
	```

	## Simulate true abilities

	```{r, echo = TRUE}
	sigma <- 1 / 500
	ability <- rbeta(players,
	shape1 = mu / sigma,
	shape2 = (1 - mu) / sigma)
	```

	## Simulate true abilities

	```{r}
	dat <- data_frame(at_bats, mu, ability)
	ggplot(dat, aes(at_bats, ability)) +
	geom_point(alpha = 0.3) +
	geom_smooth() +
	scale_x_log10() + theme_minimal() +
	ylim(0, 0.6) +
	labs(x = "Career at-bats", y = "Batting ability")
	```

	## Simulate batting averages

	```{r, echo = TRUE}
	dat$hits <- with(dat, rbinom(n = players, size = at_bats, prob = ability))
	dat$batting_avg <- with(dat, hits / at_bats)
	```

	## Simulate batting averages

	```{r}
	ggplot(dat, aes(at_bats, batting_avg)) +
	geom_point(alpha = 0.3, color = "purple") +
	geom_smooth() +
	scale_x_log10() + theme_minimal() +
	ylim(0, 0.6) +
	labs(x = "Career at-bats", y = "Observed batting average")
	```

	## Fit the beta-binomial regression

	```{r, echo = TRUE, results = "hide"}
	library(gamlss)
	bb_fit <- gamlss(cbind(hits, at_bats - hits) ~ log(at_bats),
	data = dat, family = BB(mu.link = "identity"))
	coef(bb_fit)
	```
	```{r}
	coef(bb_fit)
	```

	## Simulate to check modeling strategies

	>* In real-world data analysis, there are __almost always__ multiple ways to approach a problem.
	>* Small simulations are a useful way to test out strategies for use in a given setting.
	>* For modeling batting averages, beta-binomial regression is useful but __SLOW__.
	>* Would it work to use a __quasi-binomial glm__ instead?

	## Data-generating function

	```{r, echo = TRUE}
	simulate_batting_avgs <- function(players, beta, sigma) {
	at_bats <- sample(career_at_bats, size = players)
	mu <- beta[1] + beta[2] * log(at_bats)
	ability <- rbeta(players,
	shape1 = mu / sigma,
	shape2 = (1 - mu) / sigma)
	hits <- rbinom(n = players, size = at_bats, prob = ability)
	data_frame(at_bats, hits)
	}

	new_dat <- simulate_batting_avgs(players = 400,
	beta = c(0.140, 0.015),
	sigma = 1 / 500)
	```

	## Data-generating function

	```{r, echo = TRUE}
	new_dat
	```

	## Modeling function

	```{r, echo = TRUE}
	quasibinomial_CI <- function(dat, level = 0.95) {
	glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
	family = quasibinomial(link = "identity"))
	b <- coef(glm_fit)
	se <- sqrt(diag(vcov(glm_fit)))
	crit <- qnorm(1 - (1 - level) / 2)
	data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
	}

	quasibinomial_CI(new_dat)
	```

	## Put them together!

	```{r, echo = TRUE}
	lots_of_CIs <-
	replicate(2000, {
	dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500)
	quasibinomial_CI(dat)
	}, simplify = FALSE)

	```

	## Confidence interval coverage

	```{r}
	CI_dat <-
	bind_rows(lots_of_CIs) %>%
	group_by(term) %>%
	mutate(n = row_number())

	true_dat <- data_frame(term = c("(Intercept)","log(at_bats)"), val = c(0.140, 0.015))
	CI_dat %>% filter(n < 200) %>%
	ggplot() +
	geom_segment(aes(x = n, xend = n, y = L, yend = U)) +
	geom_hline(dat = true_dat, aes(yintercept = val), col = "blue") +
	facet_wrap(~ term, ncol = 1, scales = "free") +
	theme_minimal() +
	labs(x = "Iteration", y = "Confidence interval")
	```

	## Confidence interval coverage

	```{r, echo = TRUE}
	bind_rows(lots_of_CIs) %>%
	left_join(true_dat, by = "term") %>%
	mutate(covered = L < val & val < U) %>%
	group_by(term) %>%
	summarise(coverage = mean(covered))
	```

	```{r, eval = FALSE, include = FALSE}
	library(sandwich)
	robust_quasibinomial_CI <- function(dat, level = 0.95) {
	glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
	family = quasibinomial(link = "identity"))
	b <- coef(glm_fit)
	se <- sqrt(diag(vcovHC(glm_fit)))
	crit <- qnorm(1 - (1 - level) / 2)
	data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
	}

	lots_of_robust_CIs <-
	replicate(2000, {
	dat <- simulate_batting_avgs(players = 400,
	beta = c(0.140, 0.015),
	sigma = 1 / 500)
	robust_quasibinomial_CI(dat)
	}, simplify = FALSE)

	bind_rows(lots_of_robust_CIs) %>%
	left_join(true_dat, by = "term") %>%
	mutate(covered = L < val & val < U) %>%
	group_by(term) %>%
	summarise(coverage = mean(covered))

	```

	## Simulation studies in formal research

	Questions about quantitative methodology:

	> * Which type of confidence intervals should be used for the beta-binomial model?
	> * Which of these twelve tests should I use for one-way ANOVA when the variances are non-homogeneous?
	> * How big a sample is needed to get accurate estimates of variance components in a multi-level logistic regression model?
	> * Is it reasonable to use a multivariate normal model to impute missing data, even though the variables look skewed?

	## Why focus on simulation studies?

	Few alternatives for assessing

	>* small-sample performance of estimation methods
	>* performance of combinations of methods (data analysis pipelines)
	>* robustness under model mis-specification
	>* comparison of competing methods

	## Simulation design

	```{r, fig.width = 10, fig.height = 6, out.width = "800px"}
	library(diagram)
	par(mar = c(0.1, 0.1, 0.1, 0.1))
	openplotmat()
	elpos <- coordinates(c(2,1,2))
	fromto <- matrix(ncol = 2, byrow = TRUE,
	data = c(4, 1, 1, 2, 2, 3, 3, 5))
	nr <- nrow(fromto)
	arrpos <- matrix(ncol = 2, nrow = nr)
	for (i in 1:nr) {
	arrpos[i, ] <- straightarrow(from = elpos[fromto[i, 1], ],
	to = elpos[fromto[i, 2], ],
	lwd = 2, arr.pos = 0.6, arr.length = 0.5)
	}

	box_dat <- data_frame(lab = c("Data-generating model",
	"Estimation methods",
	"Performance criteria",
	"Experimental design",
	"Results"),
	col = c("lightblue","lightgreen","yellow","orange","red"))

	for (i in 1:nrow(box_dat)) {
	textrect(elpos[i,], 0.2, 0.1, lab = box_dat$lab[i],
	box.col = box_dat$col[i], shadow.size = 0, cex = 2)
	}

	```

	## Simulation design strategy

	* Write separate functions for each component of the simulation
	* Makes it easier to debug, modify, or re-use code
	* Test each component
	* Run in parallel where possible

	## Learning more

	* Spring, 2017: Data Analysis, Simulation, & Programming in R
	* My blog: http://jepusto.github.io/
	* code for today's examples
	* lots of other examples
	* [another lecture on designing simulations](http://jepusto.github.io/Designing-simulation-studies-using-R)
	* Ask faculty for articles with good simulation studies
	* November 9th QM colloquium: Anita Israni on "Running Simulations on the Texas Advanced Computing Cluster"