Skip to content

Instantly share code, notes, and snippets.

@jepusto
Created September 28, 2016 18:16
Show Gist options
  • Save jepusto/bf6cdb6e393f54470ba4d016199c6eb8 to your computer and use it in GitHub Desktop.
Save jepusto/bf6cdb6e393f54470ba4d016199c6eb8 to your computer and use it in GitHub Desktop.
Rmd for my presentation on simulation studies in R, Quant Methods brownbag colloquium 2016/09/28
---
title: "Designing simulation studies in R"
author: "James E. Pustejovsky"
date: "September 28, 2016"
output:
ioslides_presentation:
css: custom.css
widescreen: true
transition: faster
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5)
```
## Simulation
>1. Generate data from a __probability model__ where you know the true parameters.
>2. Apply one or more estimation procedures to the data, record the results.
>3. Repeat many times ("in repeated samples...").
>4. Compare the results to the truth (bias, accuracy, error/coverage rates).
## Why is simulation useful?
>* Building understanding & intuition about statistical models
>* Assessing/checking modeling strategies
>* Formal research
## Modeling batting averages
h/t David Robinson for this example (more details [on his blog](http://varianceexplained.org/r/beta_binomial_baseball/))
<div class="columns-2">
<img src="babe-ruth.jpg" height="400px" />
>* Baseball batting average = # hits / # at-bats
>* Correlated with # at-bats because better hitters get more opportunities
>* Beta-binomial regression is a useful way to model this relationship
</div>
## Batting averages vs. at-bats
```{r}
library(dplyr)
library(tidyr)
library(Lahman)
library(ggplot2)
pitchers <- Pitching %>%
group_by(playerID) %>%
summarize(gamesPitched = sum(G)) %>%
filter(gamesPitched > 3)
career <- Batting %>%
filter(AB > 0) %>%
anti_join(pitchers, by = "playerID") %>%
group_by(playerID) %>%
summarize_each(funs(sum), H, AB) %>%
mutate(average = H / AB)
career <- Master %>%
tbl_df() %>%
select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID")
career %>%
filter(AB >= 20) %>%
ggplot(aes(AB, average)) +
geom_point(alpha = 0.3) +
geom_smooth() +
scale_x_log10() + theme_minimal() +
labs(x = "Career at-bats", y = "Career batting-average")
```
```{r, eval = FALSE, include = FALSE}
BA_betabin <- gamlss(cbind(H, AB - H) ~ log(AB),
data = career, family = BB(mu.link = "identity", sigma.link = "inverse"))
summary(BA_betabin)
```
Source: [Lahman's Baseball Database, 1871-2015](http://www.seanlahman.com/baseball-archive/statistics/) (R package `Lahman`)
## Beta-binomial regression
>* $Y_i$ is number of hits by player $i$
>* $n_i$ is number of at-bats
>* $Y_i \sim Binomial(n_i, \pi_i)$, where $p_i$ is true batting ability of player $i$
>* $\pi_i \sim Beta\left(\mu_{n_i} / \sigma, (1 - \mu_{n_i}) / \sigma\right)$, where $\mu_{n_i}$ is average batting ability of players with $n_i$ at-bats and $\sigma$ describes the variability in true ability.
>* $\mu_{n_i} = \beta_0 + \beta_1 \log(n_i)$
## Simulate to build understanding
```{r}
career_at_bats <- career$AB
```
```{r, echo = TRUE}
set.seed(20160928)
players <- 1000
at_bats <- sample(career_at_bats, size = players)
summary(at_bats)
mu <- 0.140 + 0.015 * log(at_bats)
summary(mu)
```
## Simulate true abilities
```{r, echo = TRUE}
sigma <- 1 / 500
ability <- rbeta(players,
shape1 = mu / sigma,
shape2 = (1 - mu) / sigma)
```
## Simulate true abilities
```{r}
dat <- data_frame(at_bats, mu, ability)
ggplot(dat, aes(at_bats, ability)) +
geom_point(alpha = 0.3) +
geom_smooth() +
scale_x_log10() + theme_minimal() +
ylim(0, 0.6) +
labs(x = "Career at-bats", y = "Batting ability")
```
## Simulate batting averages
```{r, echo = TRUE}
dat$hits <- with(dat, rbinom(n = players, size = at_bats, prob = ability))
dat$batting_avg <- with(dat, hits / at_bats)
```
## Simulate batting averages
```{r}
ggplot(dat, aes(at_bats, batting_avg)) +
geom_point(alpha = 0.3, color = "purple") +
geom_smooth() +
scale_x_log10() + theme_minimal() +
ylim(0, 0.6) +
labs(x = "Career at-bats", y = "Observed batting average")
```
## Fit the beta-binomial regression
```{r, echo = TRUE, results = "hide"}
library(gamlss)
bb_fit <- gamlss(cbind(hits, at_bats - hits) ~ log(at_bats),
data = dat, family = BB(mu.link = "identity"))
coef(bb_fit)
```
```{r}
coef(bb_fit)
```
## Simulate to check modeling strategies
>* In real-world data analysis, there are __almost always__ multiple ways to approach a problem.
>* Small simulations are a useful way to test out strategies for use in a given setting.
>* For modeling batting averages, beta-binomial regression is useful but __SLOW__.
>* Would it work to use a __quasi-binomial glm__ instead?
## Data-generating function
```{r, echo = TRUE}
simulate_batting_avgs <- function(players, beta, sigma) {
at_bats <- sample(career_at_bats, size = players)
mu <- beta[1] + beta[2] * log(at_bats)
ability <- rbeta(players,
shape1 = mu / sigma,
shape2 = (1 - mu) / sigma)
hits <- rbinom(n = players, size = at_bats, prob = ability)
data_frame(at_bats, hits)
}
new_dat <- simulate_batting_avgs(players = 400,
beta = c(0.140, 0.015),
sigma = 1 / 500)
```
## Data-generating function
```{r, echo = TRUE}
new_dat
```
## Modeling function
```{r, echo = TRUE}
quasibinomial_CI <- function(dat, level = 0.95) {
glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
family = quasibinomial(link = "identity"))
b <- coef(glm_fit)
se <- sqrt(diag(vcov(glm_fit)))
crit <- qnorm(1 - (1 - level) / 2)
data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
}
quasibinomial_CI(new_dat)
```
## Put them together!
```{r, echo = TRUE}
lots_of_CIs <-
replicate(2000, {
dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500)
quasibinomial_CI(dat)
}, simplify = FALSE)
```
## Confidence interval coverage
```{r}
CI_dat <-
bind_rows(lots_of_CIs) %>%
group_by(term) %>%
mutate(n = row_number())
true_dat <- data_frame(term = c("(Intercept)","log(at_bats)"), val = c(0.140, 0.015))
CI_dat %>% filter(n < 200) %>%
ggplot() +
geom_segment(aes(x = n, xend = n, y = L, yend = U)) +
geom_hline(dat = true_dat, aes(yintercept = val), col = "blue") +
facet_wrap(~ term, ncol = 1, scales = "free") +
theme_minimal() +
labs(x = "Iteration", y = "Confidence interval")
```
## Confidence interval coverage
```{r, echo = TRUE}
bind_rows(lots_of_CIs) %>%
left_join(true_dat, by = "term") %>%
mutate(covered = L < val & val < U) %>%
group_by(term) %>%
summarise(coverage = mean(covered))
```
```{r, eval = FALSE, include = FALSE}
library(sandwich)
robust_quasibinomial_CI <- function(dat, level = 0.95) {
glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
family = quasibinomial(link = "identity"))
b <- coef(glm_fit)
se <- sqrt(diag(vcovHC(glm_fit)))
crit <- qnorm(1 - (1 - level) / 2)
data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
}
lots_of_robust_CIs <-
replicate(2000, {
dat <- simulate_batting_avgs(players = 400,
beta = c(0.140, 0.015),
sigma = 1 / 500)
robust_quasibinomial_CI(dat)
}, simplify = FALSE)
bind_rows(lots_of_robust_CIs) %>%
left_join(true_dat, by = "term") %>%
mutate(covered = L < val & val < U) %>%
group_by(term) %>%
summarise(coverage = mean(covered))
```
## Simulation studies in formal research
Questions about quantitative methodology:
> * Which type of confidence intervals should be used for the beta-binomial model?
> * Which of these twelve tests should I use for one-way ANOVA when the variances are non-homogeneous?
> * How big a sample is needed to get accurate estimates of variance components in a multi-level logistic regression model?
> * Is it reasonable to use a multivariate normal model to impute missing data, even though the variables look skewed?
## Why focus on simulation studies?
Few alternatives for assessing
>* small-sample performance of estimation methods
>* performance of combinations of methods (data analysis pipelines)
>* robustness under model mis-specification
>* comparison of competing methods
## Simulation design
```{r, fig.width = 10, fig.height = 6, out.width = "800px"}
library(diagram)
par(mar = c(0.1, 0.1, 0.1, 0.1))
openplotmat()
elpos <- coordinates(c(2,1,2))
fromto <- matrix(ncol = 2, byrow = TRUE,
data = c(4, 1, 1, 2, 2, 3, 3, 5))
nr <- nrow(fromto)
arrpos <- matrix(ncol = 2, nrow = nr)
for (i in 1:nr) {
arrpos[i, ] <- straightarrow(from = elpos[fromto[i, 1], ],
to = elpos[fromto[i, 2], ],
lwd = 2, arr.pos = 0.6, arr.length = 0.5)
}
box_dat <- data_frame(lab = c("Data-generating model",
"Estimation methods",
"Performance criteria",
"Experimental design",
"Results"),
col = c("lightblue","lightgreen","yellow","orange","red"))
for (i in 1:nrow(box_dat)) {
textrect(elpos[i,], 0.2, 0.1, lab = box_dat$lab[i],
box.col = box_dat$col[i], shadow.size = 0, cex = 2)
}
```
## Simulation design strategy
* Write separate functions for each component of the simulation
* Makes it easier to debug, modify, or re-use code
* Test each component
* Run in parallel where possible
## Learning more
* Spring, 2017: Data Analysis, Simulation, & Programming in R
* My blog: http://jepusto.github.io/
* code for today's examples
* lots of other examples
* [another lecture on designing simulations](http://jepusto.github.io/Designing-simulation-studies-using-R)
* Ask faculty for articles with good simulation studies
* November 9th QM colloquium: Anita Israni on "Running Simulations on the Texas Advanced Computing Cluster"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment