This document discusses the details of combining voter rankings into a final ranked list. (The reader is assumed to be familiar with the TERPECA nomination and voting process.)
These notes are current for the 2023 awards year.
There's no single perfect way to combine diverse preferences into a single ranking (ask any political science professor!). The algorithmic goal is to do a reasonable job of capturing the overall consensus, where there is one, with a bias to lower ranking in case of uncertainty or disagreement. If most people rank X above Y, then X should generally come above Y in the overall ranking.
There are lots of systems for creating an overall ranking from lots of individual comparisons. Notable examples include the Elo rating system used for chess, Instant Runoff Voting used in some elections, Microsoft TrueSkill for Xbox matchmaking, various systems for rating sports teams, and more besides.
Most of these systems work to improve on simple counting metrics (win count, first-place rankings, gold medals, etc), because counting metrics are heavily skewed by comparison count and "strength of schedule" (which opponents an item is compared with).
TERPECA uses a variant of "Keener's method", named after James Keener who wrote an influential paper on ranking techniques which used US college football as a motivating example. TERPECA's algorithm is based on what Keener calls the "direct approach".
The first step is to create a sparse preference matrix to capture known pairwise comparisons between the items being compared. The matrix is square, with one row and one column for every item being compared. Cells contain betterness values, non-negative numbers that represent the confidence that the row item is "better than" the column item. For any given pair of items A and B,
- if A is definitely better than B, matrix[A][B] (the cell in A's row, B's column) will be high, matrix[B][A] will be low
- if A is definitely worse than B, matrix[A][B] will be low, matrix[B][A] will be high
- if A is about the same as B, or little or nothing is known, matrix[A][B] and matrix[B][A] are both low or zero
- self-comparisons (cells on the diagonal) are zero by convention
For example, this is a preference matrix for ice cream flavors:
Flavor | vs. π« | vs. π | vs. π¦ | vs. π© | vs. π¦ |
---|---|---|---|---|---|
π« Chocolate | - | 2 | 2 | 4 | 0 |
π Strawberry | 1 | - | 0 | 5 | 0 |
π¦ Vanilla | 3 | 0 | - | 5 | 0 |
π© Poop | 1 | 0 | 0 | - | 0 |
π¦ Unicorn Milk | 5 | 5 | 5 | 5 | - |
(Poop is unpopular, everyone loves unicorn milk, and vanilla ~> chocolate ~> strawberry)
The actual betterness metric depends on the domain. For sports, it might be the fraction of games won against that opponent. For ice cream, one might use consumer surveys. For TERPECA, betterness is based on voter rankings, and will be discussed in detail below. Crucially, betterness must be normalized by comparison count, so each row's sum is the average comparative betterness for that row's item -- a value which is still dependent on strength of schedule, but not on comparison count.
Then, the eigenvector of the preference matrix with the largest eigenvalue is found, which is a schedule-independent "strength score" for every item, usable for ranking.
For example, these are strength scores for the ice cream matrix above (via this online eigenvector calculator):
Flavor | Strength | Remultiplication cross-check |
---|---|---|
π« Chocolate | 0.292 | 2*0.150 + 2*0.300 + 4*0.071 = 1.184 ~= 0.292 * 4.1 |
π Strawberry | 0.158 | 1*0.292 + 5*0.071 = 0.642 ~= 0.158 * 4.1 |
π¦ Vanilla | 0.300 | 3*0.292 + 5*0.071 = 1.231 ~= 0.292 * 4.1 |
π© Poop | 0.071 | 1*0.292 = 0.292 ~= 0.071 * 4.1 |
π¦ Unicorn Milk | 1.000 | 5*0.292 + 5*0.150 + 5*0.300 + 5*0.071 = 4.065 ~= 1.000 * 4.1 |
And these are ice cream flavors ranked by strength score:
Rank | Flavor | Strength |
---|---|---|
#1 π | π¦ Unicorn Milk | 1.000 |
#2 | π¦ Vanilla | 0.300 |
#3 | π« Chocolate | 0.292 |
#4 | π Strawberry | 0.158 |
#5 | π© Poop | 0.071 |
It may seem like quite a leap of logic to suddenly apply an abstract concept from linear algebra! But in reality the matrix has been carefully crafted for this to work, which is why the comparison scores are non-negative confidence measures, and why each row is normalized by comparison count. There are some intuitive reasons why this makes sense, as it rewards items which we know to be better than other high-scoring items.
By the Perron-Frobenius theorem, there is one unique eigenvector with the largest eigenvalue, if every betterness value is nonnegative and the overall matrix is "irreducible", meaning there is a path of nonzero comparison from every item to every other item. (Otherwise there would be multiple "islands" of items with no comparison between the islands.)
This technique is not magically perfect, but it does have some good properties:
- it is largely independent of comparison count and strength of schedule
- it captures transitivity (if X > Y and Y > Z, then X > Z)
- it's "smooth"; small winningness changes typically yield small strength changes
- it can be related quantitatively to probability, if the subject can be modeled that way
- it can be readily computed with an iterative approach
Importantly, for subjective ratings like TERPECA, there is no perfect way to determine winningness values; the formula has to be empirically tuned for sensible outcomes. In fact, most of the complexity of TERPECA ranking lies in the computing pairwise winningness from voter rankings...
TERPECA uses a multi-factor strategy to calculate the winningness of each X vs. Y comparison:
- tally direct "wins" and "losses" when X appears above or below Y in a list
- tally indirect ("secondary") wins and losses when X beats P and P beats Y directly
- using the Wilson confidence interval midpoint for combined wins and losses
These steps and their wrinkles are described in more detail below.
For every X vs. Y comparison, "wins" (when X is above Y in a voter's ranking) and "losses" (when X is below Y) are counted in both weighted and unweighted totals across all voter rankings:
flowchart LR
subgraph For every voter ranking list
subgraph For each room X on that list
portion("p = 1 / versions ranked") --o roomWinUnw & roomLossUnw
portion --> weight("weight =<br>p / sqrt(list size)") --o roomWin & roomLoss
subgraph For each room Y below X
roomWinUnw("unweightedWins<br>[X vs. Y] += p")
roomWin("weightedWins<br>[X vs. Y] += weight")
end
subgraph For each room Y above X
roomLossUnw("unweightedLosses<br>[X vs. Y] += p")
roomLoss("weightedLosses<br>[X vs. Y] += weight")
end
end
end
In the special case where variants of a room were ranked separately but ultimately scored as one room, and the same voter ranked multiple versions in the same list, wins and losses are apportioned fractionally (p in the diagram above) for weighted and unweighted tallies. This is rare.
Inputs to weighted tallies are further divided of the square root of the voter's list size, to soften the effect of voters with long lists.
For example, if a voter submitted this ranking:
Rank | Room |
---|---|
1 | π£ Pain Room (hard mode) |
2 | π¦ Fluffy Unicorns |
3 | π« Pain Room (hard with pain) |
4 | π’ Escape The Eigenvector |
5 | π Pain Room (easy mode) |
6 | π© Poop Ice Cream: The Escape Room Experience |
This voter would have a weight of 1/sqrt(6) ~= 0.408, and if Pain Room variants are merged for scoring, variant-specific wins and losses would be divided by 3, so the following wins and losses would be tallied:
Comparison | Wins | Weighted wins |
Losses | Weighted losses |
notes |
---|---|---|---|---|---|
ππ£π« vs. π¦ | 0.333 | 0.136 | 0.667 | 0.272 | 1 variant wins, 2 lose |
ππ£π« vs. π’ | 0.667 | 0.272 | 0.333 | 0.136 | 2 variants win, 1 loses |
ππ£π« vs. π© | 1.0 | 0.408 | 0 | 0 | all variants win |
π¦ vs. ππ£π« | 0.667 | 0.272 | 0.333 | 0.136 | wins against 2 variants, loses against 1 |
π¦ vs. π’ | 1.0 | 0.408 | 0 | 0 | wins |
π¦ vs. π© | 1.0 | 0.408 | 0 | 0 | wins |
π’ vs. π¦ | 0 | 0 | 1.0 | 0.408 | loses |
π’ vs. ππ£π« | 0.333 | 0.136 | 0.667 | 0.272 | wins against 1 variant, loses against 2 |
π’ vs. π© | 1.0 | 0.408 | 0 | 0 | wins |
π© vs. ππ£π« | 0 | 0 | 1.0 | 0.408 | loses |
π© vs. π¦ | 0 | 0 | 1.0 | 0.408 | loses |
π© vs. π’ | 0 | 0 | 1.0 | 0.408 | loses |
(Note the symmetry that the scores for X vs. Y are the win<>loss mirror of Y vs. X.)
Weighted and unweighted wins and losses are tallied across all valid rankings for every pair of rooms, and these values are used in subsequent steps.
For the eigenvector algorithm to work well, preference matrix betterness scores should represent the confidence that one room should be ranked higher than another. For example, 70/100 voters ranking X above Y gives more confidence 4/5 voters doing so.
For every X vs. Y comparison, TERPECA uses the Wilson confidence interval midpoint with a p-value of 0.05 to convert modified win/loss counts to confidence values:
flowchart LR
subgraph for every room X
subgraph "for every room Y (not X)"
reinflation("reinflation for X vs. Y =<br>(unweighted wins + losses)<br>/ (weighted wins + losses)")
reinflation --> reinflatedWins & reinflatedLosses
weightedWins("weightedWins") --> reinflatedWins
weightedLosses("weightedLosses") --> reinflatedLosses
reinflatedWins((Γ)) --> directWilson
reinflatedLosses((Γ)) --> directWilson
directWilson((Wilson<br>conf.)) --> directConfidence("directConfidence<br>[X vs. Y]")
end
end
The weighted wins and losses from earlier are "reinflated" by the average weighting before the Wilson confidence midpoint is computed. The weighting and "reinflation" means that while the influence of "supervoters" on the ranking of a room is reduced, the actual number of comparisons made is still taken into account for determining confidence.
For example, if rooms had these wins and losses (symmetric cases elided):
Comparison | Wins | Weighted wins |
Losses | Weighted losses |
---|---|---|---|---|
π vs. π | 70 | 30 | 30 | 20 |
π vs. π | 4 | 3 | 8 | 6 |
π vs. π | 1 | 1 | 0 | 0 |
The computed values would be as follows:
Comparison | Reinflation factor |
Reinflated wins |
Reinflated losses |
Wilson score |
---|---|---|---|---|
π vs. π | (70+30)/(30+20)=2 | 60 | 40 | 0.596 |
π vs. π | (4+8)/(3+6)=1.5 | 4.5 | 9 | 0.370 |
π vs. π | (1+0)/(1+0)=1 | 1 | 0 | 0.211 |
Note again that TERPECA uses the Wilson confidence interval midpoint, not the more commonly used lower bound! With p=0.05, the midpoint is roughly equivalent to always adding two wins and two losses, then taking the simple win fraction.
This "directConfidence" value would be suitable for a preference matrix, but TERPECA has more factors to consider...
The eigenvalue method does infer transitive relations (if A beats B, and B beats C, then A beats C), but TERPECA adds more explicit weighting for transitive relations, which empirically improves ranking quality given the sparse nature of direct comparisons. This may be the most subtle aspect of the ranking algorithm.
flowchart LR
subgraph For every room X
subgraph "For every room Y (not X)"
subgraph "For every room P (not X, Y)"
pivotFilter("valid pivot?") --> pivotWinConfidence & pivotLossConfidence
pivotWinConfidence("confidence<br>for X > P > Y")
pivotLossConfidence("confidence<br>for X < P < Y")
end
pivotWinConfidence --o indirectWinConfidence("indirectWinConfidence<br>for X > (all P) > Y")
pivotLossConfidence --o indirectLossConfidence("indirectLossConfidence<br>for X < (all P) < Y")
indirectWinConfidence & indirectLossConfidence --> reverseConfidence
reverseConfidence((inv.<br>Wilson)) --> indirectWins & indirectLosses
indirectWins("indirectWins<br>[X vs. Y]")
indirectLosses("indirectLosses<br>[X vs. Y]")
end
end
For every possible X vs. Y comparison (even when no direct comparisons exist), all other rooms are examined for suitability as tertiary "pivots". A room P is a valid pivot if:
- P is neither X nor Y
- direct comparisons are available between X and P and also between Y and P
- X beats P (directConfidence[X vs. P] > 0.5) and P doesn't lose to Y (directConfidence[Y vs. P] < 0.5)
- or the other way around (Y beats P and P doesn't lose to X)
For valid pivots, a "win confidence" (confidence in X > P > Y) and a "loss confidence" (confidence in X < P < Y) are computed as if by this pseudocode:
pivotStrength =
min(weightedWins[X vs. P], weightedLosses[Y vs. P]) +
min(weightedLosses[X vs. P], weightedWins[Y vs. P]))
scale =
min(pivotStrength, sqrt(pivotStrength)) / // to reduce the impact of any one pivot
(winConfidence[X vs. P] + winConfidence[Y vs. P])
indirectWinConfidence[X vs. Y] += scale * winConfidence[X vs. P]
indirectLossConfidence[X vs. Y] += scale * winConfidence[Y vs. P]
Finally, these confidence values (derived from winConfidence values which are Wilson confidence midpoints) are put through an inverse of the Wilson confidence midpoint computation to produce synthetic "secondaryWins[X vs. Y]" and "secondaryLosses[X vs. Y]" values.
If this all seems a little arbitrary and magical, it is largely driven by empirical observations over the years!
As a last step, for every X vs. Y comparison, wins and losses from direct comparisons (weighted) and secondary comparisons (synthetic, as above) are added into total win/loss scores for another Wilson confidence evaluation. (Note that this does not use the directConfidence value for the pair, which only feeds into the indirect comparisons.)
flowchart LR
subgraph For every room X
subgraph "For every room Y (not X)"
directWins("weightedWins") --> allWinsAdd
indirectWins("indirectWins") --> allWinsAdd
allWinsAdd((#plus;)) --> finalWilson
directLosses("weightedLosses") --> allLossesAdd
indirectLosses("indirectLosses") --> allLossesAdd
allLossesAdd((#plus;)) --> finalWilson
finalWilson((Wilson<br>conf.)) --> finalConfidence("matrix[X vs. Y]")
end
end
Once the pairwise winningness values are computed as above, the power iteration method is used to find the eigenvector with the largest eigenvalue. As discussed above, this eigenvector is directly usable for sorting the final ranked list.
A variety of reports and diagnostics are available from this process, which are outside the scope of this document.
Literature on ranking systems
- Keener, J. The Perron-Frobenius Theorem and the Ranking of Football Teams
- Newman, M.E.J. Efficient Computation of Rankings from Pairwise Computations
- Langville, A. Who's #1? The Science of Rating and Ranking
- Miller, E. How Not To Sort By Average Rating
- Vang, J. Data Science Topics, the "Rating and Ranking" section
- Wikipedia, Binomial proportion confidence interval, the "Wilson confidence interval" section
- Wikipedia, Bradley-Terry model
- Wikipedia, Sports Rating System
- Wikipedia, Pairwise Comparison (Psychology)
Code in voter-portal/.../analyze.component.ts (only visible to project members):