Have a moment to talk about our lord and savior Moneyball (2011)?
A few months ago, I was playing with prolog which got me thinking about possible applications for constraint solvers. Prolog offers a straightforward syntax to express constraints. This led me to explore constraint solving techniques in the context of maximizing points from players for a hockey pool.
With a score of 2502, the strategy detailed here would rank us in position #74 out of 16,733 of La Presse's 2022-2023 hockey pool (no affiliation!), ranking around the 99.56th percentile. Not too bad!
Using a variant of the knapsack problem, we build a team that respects our salary cap, while maximizing the value that players bring. We use players' salary, their scored points, and game logs across seasons to backtest our strategy.
Let's go through the steps necessary to build said team.
First, we must define the problem's constraints.
La Presse's hockey pool (which we will be using) uses these constraints:
- The salary cap is $82.5m
- The team must be composed of 2 goalies, 6 defenders and 12 forwards
The sum of the players' cap hit can never be above the salary cap. That the team must always respect the player type composition.
Points are scored as follows:
- Goalies get 3 points per win, 5 points per shutout, 5 points per goal and 2 points per pass
- Defenders get 3 points per goal and 2 points per pass
- Forwards get 2 points per goal and 1 point per pass
The pool allows for up to 5 exchanges per month, for every month of the season. The regular NHL 2022-2023 season started on October 7th 2022 and ended on April 13th 2023. This means we get 5 potential exchanges for each of the 7 months of the season.
Now that we've determined exactly what data we need, let's gather it.
I wasn't able to find a single source of data that contained player's statistics and cap hits throughout seasons, so I joined multiple data sources together.
For player stats, Hockey Reference offers downloadable CSV files. Basic, advanced and miscellaneous statistics for players are downloadable in different files. We download each file, and parse them to dataframes. Then, we join the dataframes together using pandas' merge function. This then becomes a single dataframe with rows representing players and columns representing their statistics.
We load the CSV with this function:
def _load_to_df(self, player_type, stats_type, columns, dtype=None):
return (
pd.read_csv(
f"data/hockey_reference/{self.season}_{player_type}_{stats_type}.csv",
header=1,
usecols=list(columns.keys()) + [ID_COLUMN],
)
.rename(columns=columns)
.astype(dtype if dtype is not None else {})
# Per-player totals are in the first row. Following rows for same player are
# stats per-player, per-team.
.drop_duplicates(subset=ID_COLUMN, keep="first")
)
And join the dataframes together with this code, into a dataframe df
:
basic = self._load_to_df(
"skaters",
"basic",
columns={
"Pos": "position",
"G": "goals",
"A": "assists",
"GP": "games_played",
"Player": "name",
},
)
basic = basic[basic["position"].isin(player_positions[player_type])]
advanced = self._load_to_df("skaters", "advanced", columns={"Pos": "position"})
advanced = advanced[advanced["position"].isin(player_positions[player_type])]
df = pd.merge(basic, advanced, how="left", on=ID_COLUMN, suffixes=("", "_y"))
df = df.drop(df.filter(regex="_y$").columns, axis=1)
df = df.merge(misc, how="left", on=ID_COLUMN, suffixes=("", "_y"))
df = df.drop(df.filter(regex="_y$").columns, axis=1)
For statistics across games, we also use Hockey Reference as a data source.
Regular games for each season are listed in a nicely structured html table on https://www.hockey-reference.com/leagues/NHL_{season}_games.html
(the 2022 season, for example). For every row in the table, we visit the corresponding game's page, which contains a bunch of tables with very detailed statistics for the game. For our purpose, the only data points we want are the scoring summary table, and the goalie statistics table. In these two tables, we have data that allows us to track player statistics across games. This is key data for backtesting player trades as the season progresses. For players, that's the points they score and their assists. For goalies, that's the decision (loss, win, or win in overtime), their total saves, their saves percentage, and if the game ended in a shutout.
For player salaries, I reverse-engineered cap-friendly's API to query for salaries by season:
def build_url(self, player_type, page):
return (
f"https://www.capfriendly.com/ajax/browse/active/{self.season}/caphit/all/{player_type}"
f"?stats-season={self.season}"
"&hide=team,clauses,age,position,handed,expiry-status,salary"
f"&pg={page}"
)
A simple GET request to the URL with the right parameters provides us with a list of players with their respective salaries. This is saved as a CSV for easy loading in a dataframe at a later time as well.
Once we've crawled everything, the data must be massaged into a state usable by the solver.
Players as listed by Hockey Reference and CapFriendly are not referred to by the same name. To fix this issue, I manually built an equivalence map between names from both sources for 44 mismatched players:
corrections = {
"Mitchell Marner": "Mitch Marner",
"Evgeni Dadonov": "Evgenii Dadonov",
"Maxime Comtois": "Max Comtois",
"Zachary Sanford": "Zach Sanford",
"Samuel Blais": "Sammy Blais",
"Nicholas Paul": "Nick Paul",
"Patrick Maroon": "Pat Maroon",
"Joseph Veleno": "Joe Veleno",
"Matthew Boldy": "Matt Boldy",
"Nick Abruzzese": "Nicholas Abruzzese",
# ... 34 more lines
Some goalies show up one season, but not another. Between the 2021-2022 and 2022-2023 seasons, we have three instances of this issue: Thomas Hodges, Matthew Berlin and Jett Alexander. All three are either emergency goaltenders, or one-time goaltenders which were brought during one of the two seasons. We just filter them out from our list of players to use in the solver.
Finally, some players are considered as defenders by one source, and forwards by another. Mason Geertsen, Hunter Drew, Luke Witkowski show up with this issue. As with the one-time goalies, they're filtered out.
After this little bit of housekeeping, we have a clean enough source of player stats across games and seasons! 🎉 Let's build the solver!
To encode the constraints as code, we use Google's OR-tools, a suite of tools to solve optimization problems. It's quite well documented and has a bunch of examples on GitHub.
More precisely, we use the CP-SAT solver, which is used to solve integer programming problems. We define the variables that represent our data, constrain them, and then solve for the best picks.
First, let's declare the list of players as variables:
x = {}
for j, players in enumerate(players_count):
for i in range(players):
x[i, j] = self.model.NewBoolVar(f'x_{i}_{j}')
players_count
contains the number of players for each group (forwards, defenders, goalies). The inner loop creates a boolean variable (NewBoolVar
) for each player, which represents if it's picked (when set to 1
) or isn't picked (when set to 0
). We store that variable in x[i, j]
where i
is the player index in its group j
. self.model
is an instance of the CP-SAT model we use for the solver.
Next, we set the desired size for each group:
for j, (players, count) in enumerate(zip(players_count, PICKS_COUNT)):
self.model.Add(
cp_model.LinearExpr.Sum([x[i, j] for i in range(players)]) == count
)
The count of players in each group must match the expected count: we want exactly 12 forwards, 6 defenders, and 2 goalies. We use a cp_model.LinearExpr.Sum
instance because it's more efficient over large lists, but we could use the native sum
python function as well since we have a small dataset.
Finally, we restrain the total salary cap of our picked players to the maximum salary cap:
self.model.Add(
cp_model.LinearExpr.Sum(
[
x[i, j] * pool[j][i]["weight"]
for j in range(len(players_count))
for i in range(players_count[j])
]
)
<= salary_cap
)
A player's (x[i, j]
) weight (pool[j][i]["weight"]
) is its cap hit for the season. This is similar to a constraint used in a knapsack multiple bins problem where we try to fit as many items with as much value as possible into as few bins as possible.
When selecting our initial pool, these are all the constraints we need. However, when trying to find potential exchanges as the season progresses, or when backtesting our strategy, we must constrain the potential exchanges:
picks = []
for i, group in enumerate(solution.pick_indices):
picks += [x[j, i] for j in group]
self.model.Add(
cp_model.LinearExpr.Sum(picks)
>= cp_model.LinearExpr.Sum(PICKS_COUNT) - trades_count
)
The count of players that we keep must be greater than or equal than the team size minus the amount of possible trades for the month. In other words, we can't trade more than 5 players per month.
Now that the problem constraints are defined, we declare the solver's objective: maximizing the total value of our picks.
expressions = []
coefficients = []
for j, players in enumerate(pool):
for i in range(len(players)):
expressions.append(x[i, j])
coefficients.append(players[i]["value"])
self.model.Maximize(cp_model.LinearExpr.WeightedSum(expressions, coefficients))
Players are expressions, and their corresponding values are the coefficients. We maximize the sum of the selected players' value.
This brings us to an important question when selecting players: how do we quantify a player's value? Since the focus is on building a basic selection and trading algorithm, we use the same value strategy as La Presse. The value strategy options are pretty much infinite. One could delve into statistical analysis, machine learning, algorithmic trading strategies used on stocks or even portfolio rebalancing strategies to create a value strategy with an edge. I go over other simple value strategies we could use in the "Better value strategy" section.
For weighing players, we use their cap hit as it's what La Presse uses to weigh players as well, and because we have a salary cap we can't bust. This is a direct translation of the knapsack problem weight constraint.
As code, it looks like this:
class LaPresseValueStrategy(ValueStrategy):
def forward(self, player):
return (player['goals'] * 2) + player['assists']
def defender(self, player):
return (player['goals'] * 3) + (player['assists'] * 2)
def goalie(self, player):
return (player['wins'] * 3) + (player['shutouts'] * 5) + (player['goals'] * 5) + (player['assists'] * 2)
Somewhat straightforward.
We have now collected the data required for backtesting, and built the solver to pick an initial alignment and trades as the season progresses. We're ready to backtest.
To confirm our solver works, and that our player value strategy is sufficiently good, we need to backtest our strategy.
We will pick players according to the 2021-2022 season, and then backtest the trading solver as the season progresses on the 2022-2023 season. I've mentioned this earlier, but to improve the value strategy, we should look at more than a single season to determine our initial picks. A player's performance in a single season is not the best representation of their potential performance in the next season.
The first step is to run the solver to pick an initial alignment. We feed player statistics for the regular 2021-2022 hockey season to the solver, with a few specific particularities:
- We assign players the cap hits of the next season (2022-2023).
- We limit the cap hit of the season to the one of the next season.
With the first adjustment, we prevent starting the season with a cap hit that's over the next season's salary cap. With the second adjustment, we maximize the total salary space that we have for picking players in the next season. As the salary cap went from $82.5m to $83.5m from the 2021-2022 season to the 2022-2023 season, we're getting an additional $1m to use for picking players.
In other words, we're picking players for the 2022-2023 season, while looking at their salary cap figures from the 2021-2022 season.
With our player alignment chosen for the start of the season, we can simulate the season progress. Every month, we sum up the total amount of points that we gathered from our alignment. Then, we call the solver with our current alignment, and specify through a constraint that we're allowed up to 5 trades.
As each month progresses, we restrict player statistics fed to the solver to a window of the past month plus a week (in other words, we look at the past ~5 weeks). This duration is most definitely a local maximum of data across the two seasons I tested on, and I expect we'd find a more accurate one by testing across more seasons. Still, this window's duration seems to offer pertinent insights into a player's recent performance, which aligns with our objective of selecting players that will bring us points.
Restricting the window the solver looks at allows the solver to decide which trades to do based on a player's recent performance, which should be a better indicator of a player's future performance than its statistics since the start of the year. This also makes our modified knapsack algorithm pluck out players who get injured, get benched, or overall make less points, since their value for the window is lower.
When summing up points for past months, we use a different statistics window than the one we use to evaluate players through the solver. The statistics windows for counting points must not overlap with each other, to prevent double-counting points. Since we trade players at the end of each month, we count points based on the past month's statistics window.
The progress function in the season simulator is in charge of this trading logic.
With our solver having the trading logic implemented and backtested, we run it!
We start the year with this team alignment:
name position games_played goals assists normalized_name cap_hit value weight wins saves saves_percent shutouts
0 Elias Lindholm C 82 42 40 eliaslindholm 4850000 124 4850000 NaN NaN NaN NaN
1 Jonathan Huberdeau LW 80 30 85 jonathanhuberdeau 5900000 145 5900000 NaN NaN NaN NaN
2 Jordan Kyrou C 74 27 48 jordankyrou 2800000 102 2800000 NaN NaN NaN NaN
3 J.T. Miller C 80 32 67 jtmiller 5250000 131 5250000 NaN NaN NaN NaN
4 Leon Draisaitl C 80 55 55 leondraisaitl 8500000 165 8500000 NaN NaN NaN NaN
5 Michael Bunting LW 79 23 40 michaelbunting 950000 86 950000 NaN NaN NaN NaN
6 Roope Hintz LW 80 37 35 roopehintz 3150000 109 3150000 NaN NaN NaN NaN
7 Ryan Hartman RW 82 34 31 ryanhartman 1700000 99 1700000 NaN NaN NaN NaN
8 Tage Thompson C 78 38 30 tagethompson 1400000 106 1400000 NaN NaN NaN NaN
9 Tim Stützle LW 79 22 36 timstutzle 925000 80 925000 NaN NaN NaN NaN
10 Trevor Zegras C 75 23 38 trevorzegras 925000 84 925000 NaN NaN NaN NaN
11 Troy Terry C 75 37 30 troyterry 1450000 104 1450000 NaN NaN NaN NaN
12 Cale Makar D 77 28 58 calemakar 9000000 200 9000000 NaN NaN NaN NaN
13 Evan Bouchard D 81 12 31 evanbouchard 863333 98 863333 NaN NaN NaN NaN
14 Kris Letang D 78 10 58 krisletang 7250000 146 6100000 NaN NaN NaN NaN
15 Moritz Seider D 82 7 43 moritzseider 863333 107 863333 NaN NaN NaN NaN
16 Roman Josi D 80 23 73 romanjosi 9059000 215 9059000 NaN NaN NaN NaN
17 Victor Hedman D 82 20 65 victorhedman 7875000 190 7875000 NaN NaN NaN NaN
18 Frederik Andersen G 52 0 4 frederikandersen 4500000 133 4500000 35.0 1320.0 0.922 4.0
19 Jacob Markström G 63 0 3 jacobmarkstrom 6000000 162 6000000 37.0 1617.0 0.922 9.0
Then, the solver executes a few trades:
Looking from September 2022 24 to October 2022 31
October 2022: 212
- Elias Lindholm 4850000 124
- Jonathan Huberdeau 5900000 145
- Kris Letang 6100000 146
- Moritz Seider 863333 107
- Roman Josi 9059000 215
+ David Pastrňák 6666667 24
+ Gabriel Vilardi 825000 21
+ Erik Karlsson 11500000 28
+ Rasmus Dahlin 6000000 30
+ Sebastian Aho 825000 24
Looking from October 2022 25 to November 2022 30
November 2022: 412
- Ryan Hartman 1700000 6
- Evan Bouchard 863333 8
- Victor Hedman 7875000 17
- Frederik Andersen 4500000 12
- Jacob Markström 6000000 12
+ Jason Robertson 7750000 46
+ Brandon Montour 3500000 38
+ Josh Morrissey 6250000 45
+ Logan Thompson 766667 32
+ Vítek Vaněček 3400000 38
Looking from November 2022 24 to December 2022 31
December 2022: 427
- J.T. Miller 5250000 26
- Trevor Zegras 925000 21
- Troy Terry 1450000 24
- Cale Makar 9000000 36
- Vítek Vaněček 3400000 38
+ Connor McDavid 12500000 53
+ Dylan Cozens 894167 26
+ Ryan Nugent-Hopkins 5125000 33
+ Erik Gustafsson 800000 36
+ Pyotr Kochetkov 842500 34
Looking from December 2022 25 to January 2023 31
January 2023: 439
- Jason Robertson 7750000 32
- Roope Hintz 3150000 29
- Josh Morrissey 6250000 49
- Logan Thompson 766667 18
- Pyotr Kochetkov 842500 34
+ Jack Hughes 8000000 40
+ Trevor Zegras 925000 25
+ Vince Dunn 4000000 41
+ Jake Oettinger 4000000 31
+ Martin Jones 2000000 37
Looking from January 2023 25 to February 2023 28
February 2023: 308
- David Pastrňák 6666667 38
- Gabriel Vilardi 825000 12
- Rasmus Dahlin 6000000 40
- Jake Oettinger 4000000 31
- Martin Jones 2000000 37
+ Claude Giroux 6500000 30
+ Dawson Mercer 894167 20
+ Josh Morrissey 6250000 29
+ Alexandar Georgiev 3400000 26
+ Ilya Samsonov 1800000 28
Looking from February 2023 22 to March 2023 31
March 2023: 455
- Jack Hughes 8000000 20
- Michael Bunting 950000 14
- Erik Gustafsson 800000 22
- Erik Karlsson 11500000 30
- Ilya Samsonov 1800000 28
+ Clayton Keller 7150000 42
+ Nathan MacKinnon 6300000 41
+ Evan Bouchard 863333 29
+ Miro Heiskanen 8450000 52
+ Stuart Skinner 750000 38
Looking from Mar 25, 2023 to Apr 30, 2023
April 2023: 249
And we end the year with 2502 points, with this team:
Total value: 2502 (124 to go)
name position games_played goals assists normalized_name cap_hit value weight wins saves saves_percent shutouts
0 Claude Giroux C 82 6 13 claudegiroux 6500000 25 6500000 NaN NaN NaN NaN
1 Clayton Keller C 82 13 16 claytonkeller 7150000 42 7150000 NaN NaN NaN NaN
2 Connor McDavid C 82 17 22 connormcdavid 12500000 56 12500000 NaN NaN NaN NaN
3 Dawson Mercer C 82 7 9 dawsonmercer 894167 23 894167 NaN NaN NaN NaN
4 Dylan Cozens C 81 8 6 dylancozens 894167 22 894167 NaN NaN NaN NaN
5 Jordan Kyrou C 79 9 6 jordankyrou 2800000 24 2800000 NaN NaN NaN NaN
6 Leon Draisaitl C 80 13 19 leondraisaitl 8500000 45 8500000 NaN NaN NaN NaN
7 Nathan MacKinnon C 71 13 15 nathanmackinnon 6300000 41 6300000 NaN NaN NaN NaN
8 Ryan Nugent-Hopkins C 82 8 18 ryannugenthopkins 5125000 34 5125000 NaN NaN NaN NaN
9 Tage Thompson C 78 8 9 tagethompson 1400000 25 1400000 NaN NaN NaN NaN
10 Tim Stützle LW 78 11 14 timstutzle 925000 36 925000 NaN NaN NaN NaN
11 Trevor Zegras C 81 3 8 trevorzegras 925000 14 925000 NaN NaN NaN NaN
12 Brandon Montour D 80 4 12 brandonmontour 3500000 36 3500000 NaN NaN NaN NaN
13 Evan Bouchard D 82 3 10 evanbouchard 863333 29 863333 NaN NaN NaN NaN
14 Josh Morrissey D 78 4 7 joshmorrissey 6250000 26 6250000 NaN NaN NaN NaN
15 Miro Heiskanen D 79 4 20 miroheiskanen 8450000 52 8450000 NaN NaN NaN NaN
16 Sebastian Aho D 71 9 9 sebastianaho 825000 45 825000 NaN NaN NaN NaN
17 Vince Dunn D 81 4 17 vincedunn 4000000 46 4000000 NaN NaN NaN NaN
18 Alexandar Georgiev G 16 0 0 alexandargeorgiev 3400000 48 3400000 11.0 1748.0 0.918562 3.0
19 Stuart Skinner G 15 0 0 stuartskinner 750000 38 750000 11.0 1403.0 0.897667 1.0
With a naive approach for evaluating and trading players, we end up with quite a good amount of points!
Even though this constraint solver works fine and ranks ok, many changes, both big and small, could make the hockey pool picker faster, more efficient, and more reliable.
I only tested with the past 2 seasons, as the process for sourcing and manually cleaning up data was lengthy. Testing across more seasons could allow for better backtesting. This would allow fine-tuning a value strategy. Furthermore, only regular season games data was sourced for reasons of simplicity. To get a better representation of a player's performance across an entire season, we could source playoff data as well.
I do all trades at the end of the month, not taking into consideration injuries. I tried taking into consideration injuries as they occurred, but finding a history of injuries for past seasons didn't seem possible without paying for some API access. This is the next thing I want to add to the trading simulator, as it's one of the most obvious opportunities for improvement.
Instead of picking the value of a player according to the La Presse strategy, one could develop their own. The La Presse value-attributing strategy gives goalies 5 points if they score a goal, which seems a bit like gambling to me.
I've coded up some strategies which aren't currently used by the solver, since they ended up giving worse results than the La Presse Strategy:
- La Presse minus gambling strategy, where we remove the value given to a goalie from goals. I'm not sure why this didn't prove to be a better value strategy than the current one. I'd expect it to perform better across multiple seasons.
- La Presse with doubled points for defenders strategy, where the value attributed to defenders is doubled, in an attempt to take advantage of the fact that defenders get 3 and 2 points for goals and assists versus 2 and 1 points for forwards. Increasing points by 100% might be too much for defenders. Maybe lowering the value multiplication factor would lead to it becoming a better strategy.
- Moneyball strategy, where a forward and defender's value is determined by assists only. For goalies, we use the save percentage. I suspect in a context where we'd build a hockey team with players that actually play together, giving a lot of value to players who pass often would increase the opportunity of scoring, and would lead to more goals.
- La Presse with percentage strategy, where a forward and a defender's value is the percentage of theoretical points it could make (naively set to 250) in a season, and a goalie its save percentage. Giving more thought into the idea of "theoretical points" a player can score could lead to this becoming a better strategy.
One could use the popular Corsi/Fenwick statistic on a team-level, and factor that in the calculation for a player value (a player on a winning team probably has a better chance of gathering points than one on a losing team). But then again, that might not suit you, because it does not reflect puck possession, which is usually a good indicator of team performance.
When picking trades, instead of looking back at the previous month plus 1 week, we could look back at the entire season, and exponentially give more value to data as it comes closer in time to the present. Intuitively, a variant of an exponential growth function could be a good candidate for this.
Since the dataset of players is relatively small (~1300 players), the speed of non-vectorized operations is tolerable. However, to improve the speed and memory efficiency of calculating points per-month, per-player, we could vectorize these operations. This would allow backtesting the strategy across many seasons faster, whenever we implement that.
Along the same line, data is loaded from csv files to dataframes on every run. We should save files in the parquet format to make the startup time faster.
When picking trades, we could list past trades which ended up being bad ones. The current algorithm picks a goalie with a single game played at some point. This probably shouldn't happen, unless the upside of having a bad goalie is outweighed by having excellent other players. I haven't looked into why exactly this trade happens yet.
We developed a solver that achieves a respectable ranking by using a constraint solver with a modified version of the knapsack algorithm. Then, we went through the pain of gathering, joining and cleaning data from different sources. Finally, we backtested the trading algorithm across the past season.
Unfortunately, our current score falls short of the top 5 positions which could've won us an iPad mini! Fortunately, we have a bunch of potential improvements we can apply to squeeze in a few more points. With the 2023-2024 season quickly approaching, we have strong motivation to do so. Let's see what it brings!
The full code is available at https://github.com/Spuffynism/hockey-pool-picker