@nflscrapR's Expected Points (EP) is a popular metric among analysts doing public research of play in the NFL. Detailed in the creators' research paper, the metric is derived from a model that was built as a part of a larger system designed to calculate individual wins above replacement values for offensive skill players.
The authors very graciously made public all of their data (nflscrapR-data) and code (nflWAR, nflscrapR-models, nflscrapR) for this project, including the code used to build the EP model. In the init_ep_fg_models.R file of the nflscrapR-models
repository, we can see that the following variables are used as inputs in the model:
TimeSecs_Remaining
: time remaining until end of halfyrdline100
: yardline from 1-99, where 1 is the possessing team's own 1 yardline and 99 is their opponent'sdown
log_ydstogo
: log transform of yards to goGoalToGo
: indicator for goal-to-go down and distancesUnder_TwoMinute_Warning
: indicator for two minute warning
Noticably absent are variables for the possessing and defensive team's "strength", as well as an indicator for whether or not the possessing team was at home. When I asked why these variables were excluded, it was explained to me (via tweet):
[The model is] creating an average baseline for which EPA calculations can then be attributed to players in some manner (then you can adjust for home field at that point) - one could make an EP mode taking into account home, team strength etc - different purpose
Regaring a home-field advantage (HFA) variable, I am not sure I understand how it is different (in a way that would merit exclusion) from any of the variables that are included. In addition, while I do see the logic regarding the exclusion of team strength variables, I believe their exclusion could be leading to biases in the EP model.
As such, the point of this analysis is to compare the performance of the current model vs one that includes HFA and team strength estimates.
I did not have the proper permissions to access the nflscrapR-models repository in order to import the functions it used to build the play-by-play (pbp) data. So, I just copied the necessary code from that repository and used it to create virtually identical functions in my own scripts. If you want to replicate this analysis, you'll need to create a file in your working directory named utils.R
that is identical to the file here.
To begin, we'll use the get_ep_model_pbp_data
function created in the utils.R
file to import the data and create two new variables:
ActPts
: number of points of the next score in the halfpos_loc
: HFA indicator for the team possessiong the ball
source('utils.R')
pbp_ep_model_data <- get_ep_model_pbp_data() %>%
mutate(
ActPts = case_when(
Next_Score_Half == 'No_Score' ~ 0,
Next_Score_Half == 'Opp_Safety' ~ -2,
Next_Score_Half == 'Safety' ~ 2,
Next_Score_Half == 'Opp_Field_Goal' ~ -3,
Next_Score_Half == 'Field_Goal' ~ 3,
Next_Score_Half == 'Opp_Touchdown' ~ -7,
Next_Score_Half == 'Touchdown' ~ 7,
TRUE ~ NA_real_
),
pos_loc = case_when(
posteam == HomeTeam ~ 1,
posteam == AwayTeam ~ -1,
TRUE ~ 0
)
)
I'm not sure if we can actually access the EP model that nflscrapR uses but we can attempt to rebuild it from init_ep_fg_models.R:
ep_model <- nnet::multinom(Next_Score_Half ~ TimeSecs_Remaining + yrdline100 +
down + log_ydstogo + GoalToGo + log_ydstogo*down +
yrdline100*down + GoalToGo*log_ydstogo +
Under_TwoMinute_Warning, data = pbp_ep_model_data,
weights = Total_W_Scaled, maxit = 300)
In order to evaluate how closely this EP model matches the one actually used, we'll compare the EP estimates in the pbp_ep_model_data
versus the estimates our model generates. We will exclude field goals because the process for generating those estimates is a little more involved and they only represent about 3% of the dataset.
model_check_data <- pbp_ep_model_data %>%
filter(PlayType != 'Field Goal', is.na(FieldGoalResult))
ep_model_pred <- data.frame(predict(ep_model, type = 'probs', model_check_data)) %>%
mutate(
our_ExpPts = (-2 * Opp_Safety) + (2 * Safety) + (-3 * Opp_Field_Goal) + (3 * Field_Goal) +
(-7 * Opp_Touchdown) + (7 * Touchdown)
)
model_check_data <- model_check_data %>%
bind_cols(ep_model_pred %>% select(our_ExpPts))
MLmetrics::R2_Score(model_check_data$our_ExpPts, model_check_data$ExpPts)
[1] 0.9999999
So, yeah, our model is virtually the same as the one currently used.
There are many ways to build team strength estimates. Similar to my previous analysis on accumulated QB hits, let's use a linear mixed-effects regression (lmer) model.
First, we'll need data for the model. Let's aggregate a team's offensive and defensive EPA at a game level and create the factors for what will be our other input variables.
game_epa <- pbp_ep_model_data %>%
filter(!is.na(EPA)) %>%
group_by(Season, GameID, posteam, DefensiveTeam, pos_loc) %>%
summarize(mean_epa = mean(EPA)) %>%
mutate(
game = as.factor(GameID),
offense = as.factor(paste(Season, posteam, sep='_')),
defense = as.factor(paste(Season, DefensiveTeam, sep='_')),
pos_loc = as.factor(pos_loc)
)
So, here is our fomula:
team_epa_formula <- mean_epa ~ pos_loc + (1|game) + (1|offense) + (1|defense)
offense
and defense
will estimate the strength of each team while game
will, sort of, capture the variance in EPA for each game. If otherwise prolific offenses underperform in a specific game (say, due to bad weather), game
should capture some of that and "punish" the offense less for it. Now, we build the model and extract the random effects (i.e., the team strengths) from the model.
team_epa_fit <- lmer(team_epa_formula, game_epa)
ran <- ranef(team_epa_fit)
pos_str <- ran$offense %>%
rename(pos_epa_str = '(Intercept)') %>%
tibble::rownames_to_column(var = 'offense') %>%
separate(offense, c('Season', 'posteam'), sep='_') %>%
mutate(
Season = as.numeric(Season),
pos_epa_str = as.vector(scale(pos_epa_str))
) %>%
arrange(-pos_epa_str)
def_str <- ran$defense %>%
rename(def_epa_str = '(Intercept)') %>%
tibble::rownames_to_column(var = 'defense') %>%
separate(defense, c('Season', 'DefensiveTeam'), sep='_') %>%
mutate(
Season = as.numeric(Season),
def_epa_str = as.vector(scale(def_epa_str))
) %>%
arrange(def_epa_str)
Looking at the top and bottom defenses, I'd say these do a decent job of estimating the team strengths.
head(pos_str, 10)
# Season posteam pos_epa_str
# 1 2016 ATL 2.830115
# 2 2011 GB 2.492116
# 3 2011 NO 2.431377
# 4 2013 DEN 2.296974
# 5 2011 NE 2.269525
# 6 2010 NE 2.204658
# 7 2012 NE 2.134231
# 8 2014 GB 1.812909
# 9 2013 SD 1.712631
# 10 2015 ARI 1.660256
tail(pos_str)
# Season posteam pos_epa_str
# 252 2009 DET -2.055390
# 253 2011 STL -2.096951
# 254 2009 CLE -2.142117
# 255 2009 TB -2.253331
# 256 2010 CAR -2.647988
# 257 2010 ARI -2.664356
head(def_str, 10)
# Season DefensiveTeam def_epa_str
# 1 2009 NYJ -3.406461
# 2 2010 CHI -2.462656
# 3 2012 CHI -2.240408
# 4 2013 SEA -2.132376
# 5 2010 PIT -2.127570
# 6 2010 SD -1.936222
# 7 2014 BUF -1.825720
# 8 2012 ARI -1.777001
# 9 2010 NYG -1.750598
# 10 2013 CIN -1.726135
tail(def_str)
# Season DefensiveTeam def_epa_str
# 252 2010 DEN 1.856413
# 253 2010 HOU 1.888760
# 254 2016 CLE 1.914799
# 255 2013 ATL 2.066987
# 256 2014 NO 2.127412
# 257 2015 NO 2.267538
Finally, let's append them to our dataset.
pbp_ep_model_data <- pbp_ep_model_data %>%
select(-pos_wpa_str, -def_wpa_str) %>%
left_join(pos_str, by = c('Season', 'posteam')) %>%
left_join(def_str, by = c('Season', 'DefensiveTeam'))
So, what do I mean when I talk about biases in an EP model that result from the exclusion of team strength variables? I'll illustrate with a chart.
Let's take all second downs where the distance was less than or equal to 20 yards and plot the average offensive and defensive strengths:
library(ggplot2)
sec_dwn_data <- pbp_ep_model_data %>%
filter(down == 2, GoalToGo == 0, ydstogo <= 20, !is.na(pos_epa_str), !is.na(def_epa_str)) %>%
group_by(ydstogo) %>% summarize(
off_mean=mean(pos_epa_str),
def_mean=mean(def_epa_str)
)
sec_dwn_plot <- ggplot(sec_dwn_data, aes(x = ydstogo)) +
geom_point(aes(y = off_mean, colour = 'off')) +
geom_smooth(method = "lm", aes(y = off_mean, colour = 'off')) +
geom_point(aes(y = def_mean, colour = 'def')) +
geom_smooth(method = "lm", aes(y = def_mean, colour = 'def')) +
labs(
title = 'Mean Team Strength by Distance - 2nd Down',
x = 'Distance',
y = 'Mean Strength'
)
plot(sec_dwn_plot)
As the 2nd down distance increase, we see that the values for the mean offense and defense get lower. In other words, the longer the distance, the worse the offense and the better the defense.
Does the trend hold for third and fourth down?
Yep! Notice also that the mean values themselves get lower on each sucessive down.
One final plot of the mean team strengths for 1st and 10s by yardline:
yl_data <- pbp_ep_model_data %>%
filter(down == 1, ydstogo == 10, !is.na(pos_epa_str), !is.na(def_epa_str)) %>%
group_by(yrdline100) %>% summarize(
off_mean=mean(pos_epa_str),
def_mean=mean(def_epa_str)
)
yl_plot <- ggplot(yl_data, aes(x = yrdline100)) +
geom_point(aes(y = off_mean, colour = 'off')) +
geom_smooth(method = "lm", aes(y = off_mean, colour = 'off')) +
geom_point(aes(y = def_mean, colour = 'def')) +
geom_smooth(method = "lm", aes(y = def_mean, colour = 'def')) +
labs(
title = 'Mean Team Strength by Yardline - 1st & 10',
x = 'Yardline',
y = 'Mean Strength'
)
plot(yl_plot)
At this point, you might be thinking "Duh, better teams put themselves in more advantageous positions. Big deal."
Consider a hypothetical 3rd and 10 between an average strength offense and an average strength defense. In the pbp dataset used to train the model, the mean strengths of an offense and defense facing a 3rd and 10 are 0.0743
and -0.0893
, values that are roughly equivalent to 46th percentile and 52nd percentile teams, respectively. In other words, a below average offense and an above average defense. However, because it is blind to team strength, the naive model will treat the average offense like it is below average and vice-versa for the defense, leading to an inflated EP value. Conversely, in more offensively advantageous down and distances, it will deflate the EP.
At least, that is the hypothesis.
For the informed model, we'll just add the team strengths (pos_epa_str
and def_epa_str
) and HFA indicator (pos_loc
) to the previous formula with no interactions. Run the new model.
new_ep_model <- nnet::multinom(Next_Score_Half ~ TimeSecs_Remaining + yrdline100 +
down + log_ydstogo + GoalToGo + log_ydstogo*down +
yrdline100*down + GoalToGo*log_ydstogo +
Under_TwoMinute_Warning + pos_epa_str + def_epa_str +
pos_loc, data = pbp_ep_model_data,
weights = Total_W_Scaled, maxit = 300)
Now, let's see how it compares to our original (naive) model.
new_model_check_data <- pbp_ep_model_data %>%
filter(PlayType != 'Field Goal', is.na(FieldGoalResult))
new_ep_model_pred <- data.frame(predict(new_ep_model, type = 'probs', new_model_check_data)) %>%
mutate(
new_ExpPts = (-2 * Opp_Safety) + (2 * Safety) + (-3 * Opp_Field_Goal) + (3 * Field_Goal) +
(-7 * Opp_Touchdown) + (7 * Touchdown)
)
new_model_check_data <- new_model_check_data %>%
bind_cols(new_ep_model_pred %>% select(new_ExpPts))
MLmetrics::R2_Score(new_model_check_data$new_ExpPts, new_model_check_data$ExpPts)
# [1] 0.8348314
An R2 of 0.83 suggest there is some variance between the two models. If you want to visualize the variance:
mod_var <- ggplot(new_model_check_data, aes(x=new_ExpPts, y=ExpPts) ) +
geom_bin2d(bins = 100) +
scale_fill_continuous(type = "viridis") +
theme_bw()
plot(mod_var)
So we know it's different. Now, let's see if it is better by comparing the ActPts
variable we created at the beginning to each model's EP predictions. We'll compare using the RMSE metric.
MLmetrics::RMSE(
new_model_check_data %>% filter(!is.na(ExpPts)) %>% pull(ExpPts),
new_model_check_data %>% filter(!is.na(ExpPts)) %>% pull(ActPts)
)
# [1] 4.688036
MLmetrics::RMSE(
new_model_check_data %>% filter(!is.na(new_ExpPts)) %>% pull(new_ExpPts),
new_model_check_data %>% filter(!is.na(new_ExpPts)) %>% pull(ActPts)
)
# [1] 4.625228
A lower RMSE indicates a better fit for the informed model.
So, we have a better model but there is a problem with using before making any other changes. Recall @nflscrapR's reasoning on leaving out team strengths in the first place:
[The model is] creating an average baseline for which EPA calculations can then be attributed to players in some manner (then you can adjust for home field at that point) - one could make an EP mode taking into account home, team strength etc - different purpose
With no further changes, players (and teams) would no longer be working from the same, average baseline. Every team would be working on a different baseline derived from their team strength.
To visualize this, let's aggregate each offense's old and new EP by season, calculate the difference in those two values, and compare the difference to each offense's strength.
ep_compare <- new_model_check_data %>%
filter(!is.na(ExpPts)) %>%
group_by(Season, posteam) %>%
summarize(
mean_ep = mean(ExpPts),
mean_new_ep = mean(new_ExpPts)
) %>%
left_join(pos_str, by = c('Season', 'posteam')) %>%
mutate(diff_ExpPts = mean_new_ep - mean_ep)
ep_diff <- ggplot(ep_compare, aes(x=pos_epa_str, y=diff_ExpPts)) +
geom_point() +
ggtitle('Change in Offensive Expected Points by Team Strength')
plot(ep_diff)
We can see a pretty clear trend that as an offense gets better, their EP inflates.
So, again, if we use this method without making any other changes, our baseline for each team will no longer be average and we won't be able to use EP in the way that it is often used by analysts. The good news is: there is a simple change we can make to reset the baseline for each team.
But the change is not to the model itself, it is to the dataset that the model will be scoring. The baseline EP for each team were only different because, in the dataset that the model scored, we told it how good each team was. If we "trick" the model into thinking each team is average, it should put them back on the same, average baseline again.
Let's try it out, starting with resetting each team's strength to zero.
pbp_ep_model_data_reset <- pbp_ep_model_data %>%
filter(PlayType != 'Field Goal', is.na(FieldGoalResult)) %>%
mutate(
pos_epa_str = 0,
def_epa_str = 0
)
Next, score this dataset with our new EP model and compare with the original EP values.
new_ep_model_pred_reset <- data.frame(predict(new_ep_model, type = 'probs', pbp_ep_model_data_reset)) %>%
mutate(
reset_ExpPts = (-2 * Opp_Safety) + (2 * Safety) + (-3 * Opp_Field_Goal) + (3 * Field_Goal) +
(-7 * Opp_Touchdown) + (7 * Touchdown)
)
pbp_ep_model_data_reset <- pbp_ep_model_data_reset %>%
bind_cols(new_ep_model_pred_reset %>% select(reset_ExpPts))
MLmetrics::R2_Score(pbp_ep_model_data_reset$reset_ExpPts, pbp_ep_model_data_reset$ExpPts)
# [1] 0.9816411
So, our EP values are a lot closer to the original values but a little bit of variance remains. Visualizing what that variance looks like:
reset_ep <- ggplot(pbp_ep_model_data_reset, aes(x=reset_ExpPts, y=ExpPts)) +
geom_bin2d(bins = 100) +
scale_fill_continuous(type = "viridis") +
theme_bw()
plot(reset_ep)
We can also compare how our reset predictions compare to the original predictions.
MLmetrics::RMSE(
pbp_ep_model_data_reset %>% filter(!is.na(ExpPts)) %>% pull(ExpPts),
pbp_ep_model_data_reset %>% filter(!is.na(ExpPts)) %>% pull(ActPts)
)
# [1] 4.688036
MLmetrics::RMSE(
pbp_ep_model_data_reset %>% filter(!is.na(reset_ExpPts)) %>% pull(reset_ExpPts),
pbp_ep_model_data_reset %>% filter(!is.na(reset_ExpPts)) %>% pull(ActPts)
)
# [1] 4.681047
So, our reset EP actually gets closer to the eventual points.
For one last point of comparison, let's revisit the change in EP versus team strength plot for our reset EP values.
*Note: For some reason, there are only 75 plays for the Jacksonville Jaguar offense in the pbp dataset. Due to their extremely small sample size relative to the other teams (every other team has 1000+ plays), their value is an extremely high outlier and skews the graph. As a result, I am going to exclude their value so we can better visualize the result.
ep_reset_compare <- pbp_ep_model_data_reset %>%
filter(!is.na(ExpPts)) %>%
group_by(Season, posteam) %>%
summarize(
mean_ep = mean(ExpPts),
mean_reset_ep = mean(reset_ExpPts)
) %>%
left_join(pos_str, by = c('Season', 'posteam')) %>%
mutate(diff_ExpPts = mean_reset_ep - mean_ep)
ep_reset_diff <- ggplot(
ep_reset_compare %>% filter(Season != 2016 | posteam != 'JAC'),
aes(x=pos_epa_str, y=diff_ExpPts)
) +
geom_point() +
ggtitle('Change in Offensive Expected Points by Team Strength - Reset EP')
plot(ep_reset_diff)
Compared to the previous plot of the change in EP, this is virtually random, suggesting that resetting our baseline with this method worked.
Given the added predictive value of the team-strength (and HFA) informed model and the success of the technique to reset the baseline for each team, I would recommend the nflscrapR EP model adopt a similar methodology.