Skip to content

Instantly share code, notes, and snippets.

@bayesball
Last active July 10, 2024 15:19
Show Gist options
  • Save bayesball/5877d9c2239f48189b0ed4c1dfe3167a to your computer and use it in GitHub Desktop.
Save bayesball/5877d9c2239f48189b0ed4c1dfe3167a to your computer and use it in GitHub Desktop.
Quarto file for Bayesian fitting of a random effects model for balls in play data with a bias component for the count.
---
title: "Count Effects Model"
format: html
editor: visual
---
Illustration of Bayesian fitting of a multilevel model for situational data using JAGS software.
Read in Statcast data from the first half of the 2023 season.
Reference: Albert, Jim (2024). Bayesian Workflow of a Situational Random Effects Model
https://bayesball.github.io/papers/bayes_workflow.html
Note: The JAGS software needs to be initially installed on one's laptop from
https://sourceforge.net/projects/mcmc-jags/files/
Outline:
- the multilevel Bayesian model
- data setup
- Tukey mean-difference plot of observed data
- simulating from posterior via JAGS
- posterior estimates of player abilities and bias effect
- posterior-predictive simulations
Common situations of interest:
- home versus away
- day versus night
- different counts
- facing pitcher of same or opposite arm
- facing pitchers of different qualities
- clutch versus non-clutch situations
My belief: Generally, situations tend to be *biases*.
- There is an effect due to the situation.
- But the size of the effect is the same for all hitters. In other words, there is little evidence that some hitters have extra ability to take advantage of the situation.
Demonstrate this belief by fitting of a "bias random effects" model for 2023 hitters on balls in play, and exploring the count advantage in hitting.
Have data on balls in-play for the 2023 season. A good measure of the quality of a ball in play is the expected wOBA statistic.
Let $y_j$ denote the square root of the expected wOBA for the $j$th ball in play.
Focus on the situations where the count is "neutral" and "behind" (ignore the "ahead" counts since relatively few in-play balls happen on ahead counts).
Model is $y_j \sim N(\mu_j, \sigma)$ where the mean has the structure $$
\mu_j = \beta_0 + \beta_1 * EFF_j + \alpha_{p(j)},
$$ where $EFF_j$ is the count effect, $\alpha_{p(j)}$ is the player effect for player $p(j)$, and the collection of player effects $\alpha_1, ..., \alpha_N$ are iid $N(0, \tau)$. At the last stage, the parameters $\sigma$ and $\tau$ are assigned weakly informative distributions.
Read in packages for this work. (These packages need to be initially installed into one's R system.)
```{r}
#| message: false
library(readr)
library(dplyr)
library(runjags)
library(coda)
library(ggplot2)
library(bayesplot)
library(purrr)
```
Read in 2023 in-play data before June 15. Focus on batters with at least 100 balls in play in this period
```{r}
#| message: false
sc_ip <- read_csv("https://raw.githubusercontent.com/bayesball/HomeRuns2021/main/sc2023_bip.csv")
```
```{r}
sc_regular_150a <- filter(sc_ip,
c_type %in% c("neutral", "behind")) |>
mutate(effect = ifelse(c_type == "neutral",
0.5, -0.5))
```
Histograms of the expected wOBA and the root data illustrating the usefulness of the root reexpression.
```{r}
ggplot(sc_regular_150a,
aes(estimated_woba_using_speedangle)) +
geom_histogram(bins = 20,
color = "white",
fill = "blue") -> p1
ggplot(sc_regular_150a,
aes(root_woba)) +
geom_histogram(bins = 20,
color = "white",
fill = "blue") -> p2
d1 <- p1$data |>
mutate(Value = estimated_woba_using_speedangle,
Type = "E(wOBA)")
d2 <- p2$data |>
mutate(Value = root_woba,
Type = "Root E(wOBA)")
ggplot(rbind(d1, d2)) +
geom_histogram(aes(Value),
bins = 20,
color = "white",
fill = "blue") +
facet_wrap(~ Type, ncol = 1) +
theme(text=element_text(size=18))
```
Using the function `mean_diff_plot()`, construct a Tukey mean-difference plot of the improvement in the Neutral count compared with the Behind count.
```{r}
mean_diff_plot <- function(sc_regular){
sc_regular %>%
group_by(player, c_type) %>%
summarize(N = n(),
M = mean(root_woba,
na.rm = TRUE),
.groups = "drop") -> S1
inner_join(filter(S1, c_type == "behind"),
filter(S1, c_type == "neutral"),
by = "player") -> S2
ggplot(S2, aes((M.x + M.y) / 2,
(M.y - M.x))) +
geom_point() +
geom_hline(yintercept = 0, color="red") +
xlab("Mean Root wOBA") +
ylab("Improvement") +
ggtitle("Improvement in Root wOBA in Neutral\n Compared to Behind Counts") +
theme(plot.title = element_text(colour = "blue", size = 18,
hjust = 0.5, vjust = 0.8, angle = 0),
plot.subtitle = element_text(colour = "red", size = 16,
hjust = 0.5, vjust = 0.8, angle = 0)) +
theme(text=element_text(size=18))
}
```
```{r}
(p1 <- mean_diff_plot(sc_regular_150a))
```
Illustration of the random effects model with a bias component:
```{r}
df <- data.frame(x1 = 0.53, x2 = 0.56, y1 = 4, y2 = 4)
ggplot(data = data.frame(x = c(.35, .7)), aes(x)) +
stat_function(fun = dnorm,
args = list(mean = 0.53, sd = 0.03)) +
stat_function(fun = dnorm,
args = list(mean = 0.56, sd = 0.03),
color = "red") +
geom_vline(xintercept = 0.53, linetype = "dashed") +
geom_vline(xintercept = 0.56, linetype = "dashed",
color = "red") +
theme(text=element_text(size=18)) +
labs(x = "Ability", y = "") +
annotate(geom = "text", x= .45, y = 10,
label = "BEHIND", size = 7) +
annotate(geom = "text", x= .65, y = 10,
label = "NEUTRAL", size = 7,
color = "red") +
geom_segment(data = df, aes(x = x1, y = y1, xend = x2,
yend = y2, colour = "segment"),
arrow = arrow(length = unit(0.03, "npc"),
ends = "both")) +
annotate(geom = "text", x= .545, y = 3,
label = "beta", parse = TRUE, size = 7,
color = "red") +
guides(color="none")
```
Fit a bias model with random effects by use of JAGS:
Define the Bayesian model in a script.
```{r}
modelString <-"
model {
## sampling
for (i in 1:N){
y[i] ~ dnorm(mu[i], invsigma2)
mu[i] <- beta * eff[i] + alpha_j[player[i]]
}
## priors
for (j in 1:J){
alpha_j[j] ~ dnorm(theta, invtau2)
}
theta ~ dnorm(0, 0.001)
beta ~ dnorm(0, 0.001)
invsigma2 <- 1 / (sigma * sigma)
sigma ~ dt(0, 1, 1) T(0, )
invtau2 <- 1 / (tau * tau)
tau ~ dt(0, 1, 1) T(0, )
}
"
```
Create a list `jags_data` including all of the data needed in the JAGS run.
```{r}
jags_data <- list(y = sc_regular_150a$root_woba,
eff = sc_regular_150a$effect,
player = sc_regular_150a$player,
N = dim(sc_regular_150a)[1],
J = max(sc_regular_150a$player))
```
Run JAGS with 1000 iterations at the adaption stage, 2000 iterations at the warmup stage, and collect 5000 iterations. Will collect all parameters $\alpha, \beta_0, \beta_1, \sigma, \tau$ in the model.
```{r}
#| message: false
posterior <- run.jags(modelString,
n.chains = 1,
data = jags_data,
monitor = c("theta", "beta", "alpha_j",
"sigma", "tau"),
adapt = 1000,
burnin = 2000,
sample = 5000)
```
Put the simulated draws in a data frame.
```{r}
posterior %>% as.mcmc() %>%
as.data.frame() -> post2
```
Plot of fitted random effects curve:
```{r}
ps <- post2[sample(5000, 20),c("theta", "tau")]
myplot <- ggplot(data.frame(x = c(.4, .7)), aes(x)) +
theme(text=element_text(size=18)) +
labs(x = expression(alpha), y = "Density")
for(j in 1:20){
myplot <- myplot +
stat_function(fun = dnorm,
args = list(mean = ps$theta[j],
sd = ps$tau[j]),
alpha = 0.35)
}
myplot
```
Several graphs comparing player estimates (observed and multilevel).
Jitter plot comparing the two sets of estimates.
```{r}
sc_regular_150a |>
group_by(player) |>
summarize(IP = n(),
Estimate = mean(root_woba)) |>
mutate(Type = "Observed") |>
select(player, Estimate, Type) -> S
get_post_est <- function(j){
mean(post2[, 2 + j])
}
S1 <- data.frame(player = 1:jags_data$J,
Estimate = sapply(1:jags_data$J,
get_post_est)) |>
mutate(Type = "Multilevel")
ggplot(rbind(S, S1), aes(Type, Estimate)) +
geom_jitter() +
coord_flip()
```
Shrinkage plot showing the shrinkage of the observed root wOBA estimates towards the multilevel estimates.
```{r}
shrinkage_plot <- function(S, N = 15, seed = 12){
# S has variables Player, Individual, Multilevel
require(tidyr)
set.seed(seed)
S2 <- pivot_longer(sample_n(S, size = N),
names_to = "Type",
values_to = "AVG", -Player)
S2a <- filter(S2, Type == "Individual")
ggplot(S2, aes(Type, AVG, group=Player)) +
geom_line() + geom_point() +
ggplot2::annotate("text", x = 0.75, y = S2a$AVG,
label = S2a$Player) +
ggtitle("Shrinkage Plot") +
theme(plot.title = element_text(
colour = "blue", size = 18, hjust = 0.5))
}
S |>
mutate(Player = player,
Individual = Estimate) |>
select(Player, Individual) -> H
S1 |>
mutate(Player = player,
Multilevel = Estimate) |>
select(Player, Multilevel) -> H1
H01 <- inner_join(H, H1, by = "Player")
shrinkage_plot(H01, N = 30, seed = 10)
```
Posterior of bias coefficient $\beta$:
```{r}
ggplot(post2, aes(beta)) +
geom_histogram(bins = 20, color = "white", fill = "tan") +
theme(text=element_text(size=18)) +
theme(plot.title = element_text(colour = "blue", size = 18,
hjust = 0.5, vjust = 0.8, angle = 0)) +
ggtitle("Posterior of Bias Coefficient")
```
Alternative display using `bayesplot` package:
```{r}
mcmc_areas(post2, "beta", prob = 0.90)
```
90% interval estimate for beta:
```{r}
quantile(post2[, "beta"], c(.05, .95))
```
The function `one_sim()` implements one simulation of replicated data from the posterior predictive distribution.
- `alpha` is a simulated draw of random effects parameters
- `mean0` is the simulated draw of $\mu_j$ for all players behind in the count
- `mean1` is the simulated draw of $\mu_j$ for all players in a neutral count
- the means `mean0` and `mean1` are merged with full dataset
- values of `y` are simulated from normal distributions
```{r}
one_sim <- function(j){
# simulates one replicated dataset
# simulated draws of random effect parameters
alpha <- unlist(post2[j, c(3:(jags_data$J + 2))])
# simulated player means when behind
mean0 <- data.frame(player = 1:jags_data$J,
mu = post2[j, "beta"] * (- 0.5) + alpha,
c_type = "behind")
# simulated player means when ahead
mean1 <- data.frame(player = 1:jags_data$J,
mu = post2[j, "beta"] * (0.5) + alpha,
c_type = "neutral")
mean01 <- rbind(mean0, mean1)
# merge player means with sc data
# simulate root woba from sampling model
inner_join(select(sc_regular_150a,
player, c_type), mean01,
by = c("player" = "player",
"c_type" = "c_type")) |>
mutate(root_woba = rnorm(jags_data$N, mu,
post2[j, "sigma"]))
}
```
The function `compare_plots()` shows mean-difference plots of the observed and one draw from posterior predictive distribution.
```{r}
compare_plots <- function(jj){
# compare observed and simulated pp m/d plots
p2 <- mean_diff_plot(one_sim(jj))
p1$data |> mutate(Type = "Observed") -> d1
p2$data |> mutate(Type = "Simulated") -> d2
ggplot(rbind(d1, d2), aes((M.x + M.y) / 2,
(M.y - M.x))) +
geom_point() +
geom_hline(yintercept = 0, color="red") +
facet_wrap(~ Type, ncol = 1) +
theme(text=element_text(size=18)) +
geom_hline(yintercept = 0, color="red") +
xlab("Root wOBA") +
ylab("Improvement") +
ggtitle("Improvement in Root wOBA in Neutral\n Compared to Behind Counts") +
theme(plot.title = element_text(colour = "blue", size = 18,
hjust = 0.5, vjust = 0.8, angle = 0),
plot.subtitle = element_text(colour = "red", size = 16,
hjust = 0.5, vjust = 0.8, angle = 0))
}
```
```{r}
compare_plots(sample(5000, 1))
```
The function test function implements a posterior-predictive check. This particular function checks if the simulated data resembles the observed data with respect to the variation in the {$(M_{neutral} + M_{behind}) /2$} values and the {$M_{neutral} - M_{behind}$} values.
The checking function is $(S_1, S_2)$ where
- $S_1 = SD((M_{neutral} + M_{behind}) / 2)$
- $S_2 = SD(M_{neutral} - M_{behind})$
```{r}
test_function <- function(jj){
p2 <- mean_diff_plot(one_sim(jj))
p2$data |>
summarize(S1 = sd((M.x + M.y) / 2),
S2 = sd(M.y - M.x))
}
```
The `test_function()` is run 500 times, collecting the values of $S_1, S_2$ from the posterior predictive distribution.
```{r}
OUT <- map(1:500, test_function) |> bind_rows()
```
The values of $S_1, S_2$ are found from the observed data.
```{r}
observed <- mean_diff_plot(sc_regular_150a)$data |>
summarize(S1 = sd((M.x + M.y) / 2),
S2 = sd(M.y - M.x))
```
The simulated draws are displayed as a scatterplot with the observed value shown as a big red point. Since the test function on the observed data is in the middle of the display, this indicates that the observed data (using this particular metric) is consistent with simulations from the fitted model.
```{r}
ggplot(OUT, aes(S1, S2)) +
geom_point() +
geom_point(data = observed, color = "red",
size = 5) +
xlab("SD(Mean)") +
ylab("SD(Diff)") +
ggtitle("Posterior Predictive Check") +
theme(plot.title = element_text(colour = "blue", size = 18,
hjust = 0.5, vjust = 0.8, angle = 0))
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment