This analysis is a follow-up to @benbbaldwin's post The accumulation of QB hits vs passing efficiency where his conclusion was that:
[For] a given level of hits, there is no evidence that the accumulation of hits makes any difference throughout the course of a game.
As the title suggests, this analysis will take a different approach to measuring the hypothesized effect.
Here is the code to get the passing data we'll be using from nflscrapR-data:
library(tidyverse)
library(dplyr)
library(MLmetrics)
library(lme4)
passes <- data.frame()
for (year in c(2009:2018)){
pbp_path <- paste0(
'https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_',
year,
'.csv'
)
season_passes <- read_csv(url(pbp_path)) %>%
filter(!is.na(epa), !is.na(ep)) %>%
mutate(
pass = ifelse(
play_type == 'pass'
| (play_type %in% c('no_play', 'run') & str_detect(desc, "((?<!Backwards )pass)|(sacked)|(scramble)")),
1, 0)
) %>%
filter(pass == 1, !is.na(down)) %>%
mutate(season = year) %>%
select(season, game_id, game_date, game_seconds_remaining, game_half, half_seconds_remaining,
score_differential, play_id, posteam, posteam_type, defteam, play_type, passer_player_id,
passer_player_name, qb_hit, sack, down, ydstogo, yardline_100, ep, epa, desc)
passes <- rbind(passes, season_passes)
rm(season_passes)
}
Ben does his analysis at the team level but this analysis is going to be at the player level. As such, we want as little missing data in the passer_player_id
column as possible.
nrow(passes %>% filter(is.na(passer_player_id)))/nrow(passes)
# [1] 0.08276332
Out of the box, about 8% of the passing plays in the dataset do not have an associated passer_player_id
, however, most of the play descriptions mention a passer. So, it's just a matter of parsing the names from the descriptions.
To do this, we will borrow from Ben's Basic nflscrapR tutorial the regex string used to extract the passer name from the play descriptions.
passer_name_re <- paste0(
'(?<=\\s)',
'[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?',
'(?=\\s((pass)|(sack)|(scramble)))'
)
passes <- passes %>%
mutate(
missing_passer_name = ifelse(
is.na(passer_player_name),
str_extract(desc, passer_name_re),
NA_character_
)
)
passer_ids <- passes %>%
filter(!is.na(passer_player_id)) %>%
distinct(season, posteam, passer_player_name, passer_player_id) %>%
rename(
missing_passer_name = passer_player_name,
missing_passer_id = passer_player_id
)
passes <- passes %>%
left_join(passer_ids, by = c('season', 'posteam', 'missing_passer_name')) %>%
mutate(
passer_player_name = ifelse(!is.na(passer_player_name), passer_player_name, missing_passer_name),
passer_player_id = ifelse(!is.na(passer_player_id), passer_player_id, missing_passer_id)
) %>%
select(-missing_passer_name, -missing_passer_id)
Once we get down to an acceptable number, we can just filter the missing plays out.
nrow(passes %>% filter(is.na(passer_player_id)))/nrow(passes)
[1] 0.0006718434
passes <- passes %>%
filter(!is.na(passer_player_id))
Finally, we'll need to create the cumulative hit variables. I also want to look at cumulative sacks and non-sack hits, so we'll create accumulated variables for them, too.
passes <- passes %>%
arrange(game_id, passer_player_id, play_id) %>%
group_by(game_id, passer_player_id) %>%
mutate(
cum_hits = cumsum(qb_hit),
cum_non_sack_hits = cumsum(ifelse(sack == 0 & qb_hit == 1, 1, 0)),
cum_sacks = cumsum(sack)
) %>%
ungroup() %>%
mutate(
cum_hits = cum_hits - ifelse(qb_hit == 1, 1, 0),
cum_sacks = cum_sacks - ifelse(sack == 1, 1, 0),
cum_non_sack_hits = cum_non_sack_hits - ifelse(sack == 0 & qb_hit == 1, 1, 0)
)
In Ben's analysis, he showed the relationship between mean_epa
and total_hits
at a game level:
# total_hits mean_epa games
# <int> <dbl> <int>
# 1 0 0.256 120
# 2 1 0.214 354
# 3 2 0.166 596
# 4 3 0.112 762
# 5 4 0.0758 771
# 6 5 0.0353 754
# 7 6 0.0202 557
# 8 7 -0.00527 428
# 9 8 -0.0383 298
# 10 9 -0.0681 192
# 11 10 -0.0463 116
# 12 11 -0.0795 81
# 13 12 -0.162 38
# 14 13 -0.124 30
# 15 14 -0.0417 11
# 16 15 -0.242 6
# 17 16 -0.184 4
He goes on to say:
This is picking up, in part, a game script effect, where overmatched teams fall behind early and are forced to pass a lot, resulting in their QB being hit more often...We want to create a level playing field.
In our analysis, we will certainly want to account for the fact that more passes result in more (opportunities for) hits. Ben also mentions teams that are overmatched, so maybe we can account for team offensive and defensive strengths, as well.
In order to account for "more passes = more hits", Ben aggregates team-game performances by the total number of hits. My approach will utilize regression models where individual passes are the observations, the resulting epa values are the target, and aggregation is, therefore, unnecessary.
Let's build a base model using the context in the play-by-play. In order to test the validity of the model, we'll split our dataset into train
and test
sets.
smp_size <- floor(0.65 * nrow(passes))
set.seed(69)
train_ind <- sample(seq_len(nrow(passes)), size = smp_size)
train <- passes[train_ind, ]
test <- passes[-train_ind, ]
base_formula <- epa ~ posteam_type + down_fac*ydstogo + yardline_100 + ep
base_fit <- lm(base_formula, train)
summary(base_fit)
# Call:
# lm(formula = base_formula, data = train)
#
# Residuals:
# Min 1Q Median 3Q Max
# -12.5282 -0.8374 -0.2543 0.9490 9.2732
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.4120844 0.0718716 19.647 < 2e-16 ***
# posteam_typehome 0.0429638 0.0089117 4.821 1.43e-06 ***
# down_fac2 -0.0589676 0.0388494 -1.518 0.1291
# down_fac3 -0.2761082 0.0390585 -7.069 1.57e-12 ***
# down_fac4 -0.3774475 0.0692393 -5.451 5.01e-08 ***
# ydstogo -0.0158859 0.0033357 -4.762 1.92e-06 ***
# yardline_100 -0.0129320 0.0006947 -18.616 < 2e-16 ***
# ep -0.2252849 0.0104915 -21.473 < 2e-16 ***
# down_fac2:ydstogo -0.0060017 0.0037805 -1.588 0.1124
# down_fac3:ydstogo -0.0076389 0.0037465 -2.039 0.0415 *
# down_fac4:ydstogo -0.0518295 0.0076304 -6.793 1.11e-11 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.625 on 133070 degrees of freedom
# Multiple R-squared: 0.005654, Adjusted R-squared: 0.005579
# F-statistic: 75.66 on 10 and 133070 DF, p-value: < 2.2e-16
R2_Score(predict(base_fit, test), test$epa)
# [1] 0.005795776
While the overall R2 obviously is not great, it is still better than nothing. Now we can test our accumulated hit variables. Starting with cum_hits
:
hits_formula <- epa ~ posteam_type + down_fac*ydstogo + yardline_100 + ep + cum_hits
hits_fit <- lm(hits_formula, train)
summary(hits_fit)
# Call:
# lm(formula = hits_formula, data = train)
#
# Residuals:
# Min 1Q Median 3Q Max
# -12.5375 -0.8386 -0.2542 0.9479 9.2702
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.4376849 0.0725113 19.827 < 2e-16 ***
# posteam_typehome 0.0415764 0.0089268 4.658 3.20e-06 ***
# down_fac2 -0.0606015 0.0388534 -1.560 0.11882
# down_fac3 -0.2798543 0.0390830 -7.161 8.08e-13 ***
# down_fac4 -0.3781659 0.0692383 -5.462 4.72e-08 ***
# ydstogo -0.0159250 0.0033356 -4.774 1.81e-06 ***
# yardline_100 -0.0130787 0.0006969 -18.768 < 2e-16 ***
# ep -0.2272435 0.0105170 -21.607 < 2e-16 ***
# cum_hits -0.0052878 0.0019876 -2.660 0.00781 **
# down_fac2:ydstogo -0.0059233 0.0037806 -1.567 0.11717
# down_fac3:ydstogo -0.0074938 0.0037468 -2.000 0.04550 *
# down_fac4:ydstogo -0.0511072 0.0076350 -6.694 2.18e-11 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.625 on 133069 degrees of freedom
# Multiple R-squared: 0.005707, Adjusted R-squared: 0.005625
# F-statistic: 69.43 on 11 and 133069 DF, p-value: < 2.2e-16
R2_Score(predict(hits_fit, test), test$epa)
# [1] 0.005863101
Including cum_hits
in the model does give a better fit. Next let's try cum_sacks
:
sacks_formula <- epa ~ posteam_type + down_fac*ydstogo + yardline_100 + ep + cum_sacks
sacks_fit <- lm(sacks_formula, train)
summary(sacks_fit)
# Call:
# lm(formula = sacks_formula, data = train)
#
# Residuals:
# Min 1Q Median 3Q Max
# -12.5371 -0.8390 -0.2541 0.9480 9.2584
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.4382681 0.0722595 19.904 < 2e-16 ***
# posteam_typehome 0.0420994 0.0089148 4.722 2.33e-06 ***
# down_fac2 -0.0615970 0.0388551 -1.585 0.112900
# down_fac3 -0.2813389 0.0390856 -7.198 6.14e-13 ***
# down_fac4 -0.3789212 0.0692377 -5.473 4.44e-08 ***
# ydstogo -0.0159563 0.0033356 -4.784 1.72e-06 ***
# yardline_100 -0.0130707 0.0006958 -18.785 < 2e-16 ***
# ep -0.2272693 0.0105064 -21.631 < 2e-16 ***
# cum_sacks -0.0119597 0.0034285 -3.488 0.000486 ***
# down_fac2:ydstogo -0.0058117 0.0037808 -1.537 0.124253
# down_fac3:ydstogo -0.0073112 0.0037475 -1.951 0.051065 .
# down_fac4:ydstogo -0.0507428 0.0076364 -6.645 3.05e-11 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.625 on 133069 degrees of freedom
# Multiple R-squared: 0.005745, Adjusted R-squared: 0.005663
# F-statistic: 69.9 on 11 and 133069 DF, p-value: < 2.2e-16
R2_Score(predict(sacks_fit, test), test$epa)
# [1] 0.005801378
While cum_sacks
does result in a higher R2 for the model, the R2 on the test set is slightly lower. Finally, let's try combining cum_sacks
and cum_non_sack_hits
:
comb_formula <- epa ~ posteam_type + down_fac*ydstogo + yardline_100 + ep + cum_sacks + cum_non_sack_hits
comb_fit <- lm(comb_formula, train)
summary(comb_fit)
# Call:
# lm(formula = comb_formula, data = train)
#
# Residuals:
# Min 1Q Median 3Q Max
# -12.5385 -0.8389 -0.2544 0.9477 9.2600
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.4417973 0.0725317 19.878 < 2e-16 ***
# posteam_typehome 0.0418159 0.0089290 4.683 2.83e-06 ***
# down_fac2 -0.0616384 0.0388553 -1.586 0.11266
# down_fac3 -0.2816011 0.0390885 -7.204 5.87e-13 ***
# down_fac4 -0.3789051 0.0692379 -5.473 4.44e-08 ***
# ydstogo -0.0159554 0.0033356 -4.783 1.73e-06 ***
# yardline_100 -0.0130930 0.0006969 -18.787 < 2e-16 ***
# ep -0.2275456 0.0105180 -21.634 < 2e-16 ***
# cum_sacks -0.0113891 0.0035755 -3.185 0.00145 **
# cum_non_sack_hits -0.0016574 0.0029474 -0.562 0.57389
# down_fac2:ydstogo -0.0058219 0.0037808 -1.540 0.12360
# down_fac3:ydstogo -0.0073241 0.0037476 -1.954 0.05066 .
# down_fac4:ydstogo -0.0507045 0.0076367 -6.640 3.16e-11 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.625 on 133068 degrees of freedom
# Multiple R-squared: 0.005747, Adjusted R-squared: 0.005658
# F-statistic: 64.1 on 12 and 133068 DF, p-value: < 2.2e-16
R2_Score(predict(comb_fit, test), test$epa)
# [1] 0.005818752
In addition to all of the R2 values being roughly the same as before, cum_non_sack_hits
has a fairly insignificant p-value. Of the three models that we've tested with accumulated hit variables, it is unclear to me which is best (or if any of them are actually better than the base model).
So, let's try adding some more context.
Toward the end of his analysis, Ben talks about the relationship between efficiency and number of hits:
The negative relationship between QB hits and efficiency is because the group of teams that get hit often are the only ones to make it to the high numbers of hits. Stated this way, it sounds obvious, but it's important. These teams aren't necessarily inefficient because their QBs are getting hit a lot; but rather, their QBs are getting hit a lot because they're bad teams to begin with.
Maybe we should try to account for just how good each offense and defense are. One way to do this is with linear mixed-effects models. If you are unfamiliar with these types of models, I would reccomend checking out the piece that @StatsByLopez wrote on estimating fumble frequencies during DeflateGate.
In our model(s), we treat all of the variables from our simple linear models as fixed effects and the offenses and defenses as random effects. Offenses will be split by season and QB while defenses will be split by season.
First, we'll build our offensive and defensive ids and resplit our dataset using the same seed as before:
passes <- passes %>%
mutate(
offense = as.factor(paste(season, passer_player_id, sep=':')),
defense = as.factor(paste(season, defteam, sep='_'))
)
smp_size <- floor(0.65 * nrow(passes))
set.seed(69)
train_ind <- sample(seq_len(nrow(passes)), size = smp_size)
train <- passes[train_ind, ]
test <- passes[-train_ind, ]
Now, we can build our mixed-effects models. Building a linear mixed-effects model with lme4
is very similar to building a linear model with base R. The only major difference is in the specification of the random effects terms.
Before we look at the accumulated hit variables again, we'll start with a base model:
base_formula_mm <- epa ~ posteam_type + down_fac*ydstogo + yardline_100 + ep + (1|offense) + (1|defense)
base_fit_mm <- lmer(base_formula_mm, train)
summary(base_fit_mm)
# Linear mixed model fit by REML ['lmerMod']
# Formula: epa ~ posteam_type + down_fac * ydstogo + yardline_100 + ep + (1 | offense) + (1 | defense)
# Data: train
#
# REML criterion at convergence: 506598
#
# Scaled residuals:
# Min 1Q Median 3Q Max
# -7.5730 -0.5203 -0.1509 0.5804 5.6803
#
# Random effects:
# Groups Name Variance Std.Dev.
# offense (Intercept) 0.016764 0.12948
# defense (Intercept) 0.004687 0.06846
# Residual 2.620479 1.61879
# Number of obs: 133081, groups: offense, 870; defense, 321
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 1.387130 0.072148 19.226
# posteam_typehome 0.046873 0.008936 5.245
# down_fac2 -0.064708 0.038772 -1.669
# down_fac3 -0.279516 0.038991 -7.169
# down_fac4 -0.372395 0.069123 -5.387
# ydstogo -0.016309 0.003329 -4.898
# yardline_100 -0.012859 0.000694 -18.529
# ep -0.229296 0.010485 -21.870
# down_fac2:ydstogo -0.005383 0.003772 -1.427
# down_fac3:ydstogo -0.006852 0.003739 -1.833
# down_fac4:ydstogo -0.048603 0.007617 -6.381
#
# Correlation of Fixed Effects:
# (Intr) pstm_t dwn_f2 dwn_f3 dwn_f4 ydstog yr_100 ep dwn_2: dwn_3:
# postm_typhm -0.056
# down_fac2 -0.498 -0.006
# down_fac3 -0.626 -0.005 0.801
# down_fac4 -0.558 0.003 0.472 0.525
# ydstogo -0.531 -0.006 0.866 0.878 0.512
# yardlin_100 -0.858 -0.001 0.078 0.221 0.358 0.058
# ep -0.869 -0.008 0.100 0.251 0.372 0.100 0.961
# dwn_fc2:yds 0.367 0.006 -0.950 -0.740 -0.407 -0.863 0.052 0.033
# dwn_fc3:yds 0.329 0.004 -0.752 -0.897 -0.393 -0.867 0.099 0.081 0.771
# dwn_fc4:yds 0.250 0.000 -0.378 -0.386 -0.767 -0.434 -0.052 -0.062 0.374 0.375
R2_Score(predict(base_fit_mm, test, allow.new.levels = TRUE), test$epa)
# [1] 0.008938265
While the summaries for lmer
objects don't provide R2 values for models, we can see the vast improvement in our test R2.
Once again, let's add in cum_hits
:
hits_formula_mm <- epa ~ posteam_type + down_fac*ydstogo + yardline_100 + ep + cum_hits +
(1|offense) + (1|defense)
hits_fit_mm <- lmer(hits_formula_mm, train)
summary(hits_fit_mm)
# Linear mixed model fit by REML ['lmerMod']
# Formula: epa ~ posteam_type + down_fac * ydstogo + yardline_100 + ep + cum_hits + (1 | offense) + (1 | defense)
# Data: train
#
# REML criterion at convergence: 506607
#
# Scaled residuals:
# Min 1Q Median 3Q Max
# -7.5760 -0.5205 -0.1510 0.5805 5.6804
#
# Random effects:
# Groups Name Variance Std.Dev.
# offense (Intercept) 0.016658 0.12907
# defense (Intercept) 0.004625 0.06801
# Residual 2.620538 1.61881
# Number of obs: 133081, groups: offense, 870; defense, 321
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 1.3995570 0.0728081 19.223
# posteam_typehome 0.0461532 0.0089537 5.155
# down_fac2 -0.0654363 0.0387764 -1.688
# down_fac3 -0.2812497 0.0390160 -7.209
# down_fac4 -0.3727959 0.0691236 -5.393
# ydstogo -0.0163275 0.0033296 -4.904
# yardline_100 -0.0129301 0.0006963 -18.571
# ep -0.2302172 0.0105107 -21.903
# cum_hits -0.0025855 0.0020533 -1.259
# down_fac2:ydstogo -0.0053506 0.0037725 -1.418
# down_fac3:ydstogo -0.0067897 0.0037394 -1.816
# down_fac4:ydstogo -0.0482605 0.0076223 -6.331
#
# Correlation of Fixed Effects:
# (Intr) pstm_t dwn_f2 dwn_f3 dwn_f4 ydstog yr_100 ep cm_hts dwn_2: dwn_3:
# postm_typhm -0.064
# down_fac2 -0.495 -0.005
# down_fac3 -0.625 -0.003 0.801
# down_fac4 -0.553 0.004 0.472 0.525
# ydstogo -0.527 -0.006 0.866 0.877 0.512
# yardlin_100 -0.859 0.004 0.079 0.223 0.357 0.058
# ep -0.869 -0.004 0.101 0.253 0.371 0.100 0.962
# cum_hits -0.135 0.063 0.015 0.036 0.005 0.005 0.081 0.071
# dwn_fc2:yds 0.364 0.005 -0.950 -0.740 -0.407 -0.863 0.051 0.032 -0.007
# dwn_fc3:yds 0.328 0.004 -0.752 -0.897 -0.393 -0.867 0.098 0.079 -0.014 0.771
# dwn_fc4:yds 0.253 -0.002 -0.379 -0.387 -0.767 -0.434 -0.055 -0.064 -0.037 0.374 0.375
R2_Score(predict(hits_fit_mm, test, allow.new.levels = TRUE), test$epa)
# [1] 0.008952836
Adding add cum_hits
does result in a slight improvement over our base mixed-effects model's test R2. Note that a t value of -1.259
with 133000 df is roughly equivalent to a p value of 0.2
.
Now for cum_sacks
:
sacks_formula_mm <- epa ~ posteam_type + down_fac*ydstogo + yardline_100 + ep + cum_sacks +
(1|offense) + (1|defense)
sacks_fit_mm <- lmer(sacks_formula_mm, train)
summary(sacks_fit_mm)
# Linear mixed model fit by REML ['lmerMod']
# Formula: epa ~ posteam_type + down_fac * ydstogo + yardline_100 + ep + cum_sacks + (1 | offense) + (1 | defense)
# Data: train
#
# REML criterion at convergence: 506605.2
#
# Scaled residuals:
# Min 1Q Median 3Q Max
# -7.5764 -0.5204 -0.1504 0.5804 5.6773
#
# Random effects:
# Groups Name Variance Std.Dev.
# offense (Intercept) 0.016586 0.12879
# defense (Intercept) 0.004606 0.06787
# Residual 2.620561 1.61881
# Number of obs: 133081, groups: offense, 870; defense, 321
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 1.3988302 0.0725516 19.280
# posteam_typehome 0.0464523 0.0089400 5.196
# down_fac2 -0.0657950 0.0387790 -1.697
# down_fac3 -0.2817349 0.0390198 -7.220
# down_fac4 -0.3730822 0.0691244 -5.397
# ydstogo -0.0163379 0.0033296 -4.907
# yardline_100 -0.0129211 0.0006952 -18.586
# ep -0.2301406 0.0105002 -21.918
# cum_sacks -0.0053061 0.0035276 -1.504
# down_fac2:ydstogo -0.0053076 0.0037728 -1.407
# down_fac3:ydstogo -0.0067216 0.0037401 -1.797
# down_fac4:ydstogo -0.0481505 0.0076236 -6.316
#
# Correlation of Fixed Effects:
# (Intr) pstm_t dwn_f2 dwn_f3 dwn_f4 ydstog yr_100 ep cm_sck dwn_2: dwn_3:
# postm_typhm -0.059
# down_fac2 -0.497 -0.005
# down_fac3 -0.626 -0.004 0.801
# down_fac4 -0.555 0.004 0.472 0.525
# ydstogo -0.529 -0.006 0.866 0.877 0.512
# yardlin_100 -0.858 0.001 0.079 0.223 0.358 0.058
# ep -0.869 -0.007 0.101 0.253 0.371 0.100 0.961
# cum_sacks -0.106 0.030 0.019 0.038 0.006 0.006 0.059 0.055
# dwn_fc2:yds 0.366 0.005 -0.950 -0.740 -0.407 -0.863 0.051 0.032 -0.014
# dwn_fc3:yds 0.329 0.004 -0.752 -0.897 -0.393 -0.867 0.097 0.079 -0.024 0.771
# dwn_fc4:yds 0.253 -0.001 -0.379 -0.387 -0.767 -0.434 -0.054 -0.064 -0.041 0.375 0.376
#
R2_Score(predict(sacks_fit_mm, test, allow.new.levels = TRUE), test$epa)
# [1] 0.008908527
The test R2 drops below what we were getting in our base model while the t value for cum_sacks
is roughly equivalent to a p value of 0.13
.
Let's finish off with cum_sacks
+ cum_non_sack_hits
:
comb_formula_mm <- epa ~ posteam_type + down_fac*ydstogo + yardline_100 + ep + cum_sacks +
cum_non_sack_hits + (1|offense) + (1|defense)
comb_fit_mm <- lmer(comb_formula_mm, train)
summary(comb_fit_mm)
# Linear mixed model fit by REML ['lmerMod']
# Formula: epa ~ posteam_type + down_fac * ydstogo + yardline_100 + ep + cum_sacks + cum_non_sack_hits + (1 | offense) + (1 | defense)
# Data: train
#
# REML criterion at convergence: 506614.7
#
# Scaled residuals:
# Min 1Q Median 3Q Max
# -7.5769 -0.5205 -0.1508 0.5805 5.6784
#
# Random effects:
# Groups Name Variance Std.Dev.
# offense (Intercept) 0.016585 0.12878
# defense (Intercept) 0.004601 0.06783
# Residual 2.620579 1.61882
# Number of obs: 133081, groups: offense, 870; defense, 321
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 1.4017879 0.0728271 19.248
# posteam_typehome 0.0461963 0.0089567 5.158
# down_fac2 -0.0658132 0.0387792 -1.697
# down_fac3 -0.2819329 0.0390222 -7.225
# down_fac4 -0.3730766 0.0691246 -5.397
# ydstogo -0.0163371 0.0033296 -4.907
# yardline_100 -0.0129399 0.0006963 -18.583
# ep -0.2303694 0.0105117 -21.916
# cum_sacks -0.0047892 0.0036979 -1.295
# cum_non_sack_hits -0.0014459 0.0030961 -0.467
# down_fac2:ydstogo -0.0053173 0.0037729 -1.409
# down_fac3:ydstogo -0.0067335 0.0037402 -1.800
# down_fac4:ydstogo -0.0481156 0.0076240 -6.311
#
# Correlation matrix not shown by default, as p = 13 > 12.
# Use print(x, correlation=TRUE) or
# vcov(x) if you need it
R2_Score(predict(comb_fit_mm, test, allow.new.levels = TRUE), test$epa)
# [1] 0.008923579
Again, no improvement on the test R2 from the base model and the t value for cum_non_sack_hits
is the lowest we've seen.
Ben reminded me on Twitter that I had neglected to include score_differntial
in the model(s). Here is what the results are when included in the simple linear models and the mixed models.
So, our best (most predictive) model was the mixed-effects model that included cum_hits
. The estimate for cum_hits
of -0.0026
meant that with every hit, each subsequent pass would drop by 0.0026
EPA.
As an example, take a QB that attempts 30 passes in a single game. If that QB was hit for the 3rd time on their 15th throw, assuming they are not hit again, the EPA lost on their last 15 throws as a result of accumulated hits would be 15 * 3 * -0.0026 ~= -0.116
.
Now, what if, in the worst case scenario, that same QB had been hit on every play and the EPA lost on their last 15 throws was instead 15 * 15 * -0.0026 ~= -0.62052
?
Considering that:
-0.116
is slightly better than the average 4 yard gain on 1st and 10-0.621
is slightly better than the average 1 yard gain on 2nd and 10
even in the worst-case scenario, the losses in EPA that result from an accumulation of hits is equivalent to one moderately bad play. That's it.
And that's only if you think cum_hits
should have been included in the model. Either way, nothing in mine or Ben's analyses suggest that accumulated hits are worth considering.
For fun, let's look at the random effects from our mixed-effects model(s). Here are the the distributions of the offensive and defensive units:
ran <- ranef(base_fit_mm)
plot_data <- rbind(
ran$offense %>% rename(strength='(Intercept)') %>% mutate(side='offense'),
ran$defense %>% rename(strength='(Intercept)') %>% mutate(side='defense')
)
library(ggplot2)
ggplot(plot_data, aes(strength, fill = side)) +
geom_density(alpha = 0.35) +
ggtitle('Offensive vs Defensive Passing EPA Strength')
The tighter distribution of the defensive strengths around zero suggest that defenses have less of an impact on passing EPA than offenses.
The top 15 offensive passing seasons since 2009 seem to generally make sense:
passers <- passes %>%
distinct(passer_player_id, passer_player_name) %>%
mutate(nchar=nchar(passer_player_name)) %>%
arrange(passer_player_id, -nchar) %>%
distinct(passer_player_id, .keep_all = TRUE) %>%
select(-nchar)
qb_str <- ran$offense %>%
rename(off_pass_epa_int = '(Intercept)') %>%
tibble::rownames_to_column(var = 'offense') %>%
mutate(off_pass_epa_var = attr(ran$offense, "postVar")[1, 1, ]) %>%
separate(offense, c('season', 'passer_player_id'), sep=':') %>%
mutate(season=as.numeric(season)) %>%
arrange(-off_pass_epa_int) %>%
left_join(passers, by='passer_player_id') %>%
left_join(
passes %>% group_by(season, passer_player_id) %>% summarize(passes=length(play_id)),
by = c('season', 'passer_player_id')
)
head(qb_str, 15)
# season passer_player_id off_pass_epa_int off_pass_epa_var passer_player_name passes
# 1 2018 00-0033873 0.3228594 0.004484434 P.Mahomes 671
# 2 2009 00-0010346 0.2973676 0.004809476 P.Manning 608
# 3 2014 00-0023459 0.2842181 0.004781794 A.Rodgers 605
# 4 2011 00-0020531 0.2779966 0.004270427 D.Brees 737
# 5 2018 00-0020531 0.2737248 0.005265880 D.Brees 539
# 6 2011 00-0023459 0.2696903 0.004807023 A.Rodgers 592
# 7 2016 00-0026143 0.2507002 0.004712305 M.Ryan 616
# 8 2013 00-0010346 0.2383985 0.004307639 P.Manning 716
# 9 2012 00-0019596 0.2243546 0.004396295 T.Brady 703
# 10 2011 00-0019596 0.2214261 0.004588874 T.Brady 687
# 11 2009 00-0022942 0.2189321 0.005354635 P.Rivers 542
# 12 2015 00-0029263 0.2171071 0.004994096 R.Wilson 602
# 13 2015 00-0022924 0.2169744 0.005290244 B.Roethlisberger 527
# 14 2012 00-0010346 0.2147981 0.004781978 P.Manning 638
# 15 2017 00-0032950 0.2140824 0.005416154 C.Wentz 525
And the top 10 defensive passing seasons belong to:
def_str <- ran$defense %>%
rename(def_pass_epa_int = '(Intercept)') %>%
tibble::rownames_to_column(var = 'defense') %>%
mutate(def_pass_epa_var = attr(ran$defense, "postVar")[1, 1, ]) %>%
separate(defense, c('season', 'defteam'), sep='_') %>%
mutate(season=as.numeric(season)) %>%
arrange(def_pass_epa_int)
head(def_str, 10)
# season defteam def_pass_epa_int def_pass_epa_var
# 1 2016 DEN -0.12672999 0.002695580
# 2 2009 NYJ -0.11572561 0.002947501
# 3 2012 CIN -0.11321756 0.002833850
# 4 2013 SEA -0.11027701 0.002831613
# 5 2012 ARI -0.10525198 0.002844016
# 6 2017 BAL -0.09783000 0.002721691
# 7 2011 NYJ -0.09685708 0.002879514
# 8 2017 JAX -0.09169130 0.002801606
# 9 2015 DEN -0.09158392 0.002712855
# 10 2009 BUF -0.08843870 0.002826532
And finally, the bottom 15 offensive passing seasons:
tail(qb_str, 15)
# season passer_player_id off_pass_epa_int off_pass_epa_var passer_player_name passes
# 856 2018 00-0026898 -0.1621003 0.014478714 M.Sanchez 43
# 857 2013 00-0027948 -0.1644682 0.011977040 B.Gabbert 107
# 858 2017 00-0028595 -0.1670066 0.015061625 S.Tolzien 24
# 859 2009 00-0026898 -0.1673196 0.006354349 M.Sanchez 414
# 860 2012 00-0027754 -0.1674245 0.008710545 J.Skelton 224
# 861 2018 00-0033958 -0.1836659 0.012063079 N.Peterman 105
# 862 2013 00-0030526 -0.1937594 0.006649194 E.Manuel 378
# 863 2015 00-0029567 -0.2034197 0.006527679 N.Foles 373
# 864 2018 00-0034343 -0.2044619 0.005956765 J.Rosen 475
# 865 2012 00-0029531 -0.2106621 0.009626361 R.Lindley 193
# 866 2016 00-0033106 -0.2213431 0.008055449 J.Goff 251
# 867 2009 00-0022164 -0.2309472 0.009012687 K.Boller 209
# 868 2010 00-0027659 -0.2539440 0.006828976 J.Clausen 354
# 869 2009 00-0025388 -0.2648288 0.007824201 J.Russell 299
# 870 2011 00-0025970 -0.2885444 0.010492451 C.Hanie (3rd QB) 133