TERPECA ranking algorithm detail

This document discusses the details of combining voter rankings into a final ranked list. (The reader is assumed to be familiar with the TERPECA nomination and voting process.)

These notes are current for the 2023 awards year.

Goals

There's no single perfect way to combine diverse preferences into a single ranking (ask any political science professor!). The algorithmic goal is to do a reasonable job of capturing the overall consensus, where there is one, with a bias to lower ranking in case of uncertainty or disagreement. If most people rank X above Y, then X should generally come above Y in the overall ranking.

"Keener's method": Eigenvector of the preference matrix

There are lots of systems for creating an overall ranking from lots of individual comparisons. Notable examples include the Elo rating system used for chess, Instant Runoff Voting used in some elections, Microsoft TrueSkill for Xbox matchmaking, various systems for rating sports teams, and more besides.

Most of these systems work to improve on simple counting metrics (win count, first-place rankings, gold medals, etc), because counting metrics are heavily skewed by comparison count and "strength of schedule" (which opponents an item is compared with).

TERPECA uses a variant of "Keener's method", named after James Keener who wrote an influential paper on ranking techniques which used US college football as a motivating example. TERPECA's algorithm is based on what Keener calls the "direct approach".

The first step is to create a sparse preference matrix to capture known pairwise comparisons between the items being compared. The matrix is square, with one row and one column for every item being compared. Cells contain betterness values, non-negative numbers that represent the confidence that the row item is "better than" the column item. For any given pair of items A and B,

if A is definitely better than B, matrix[A][B] (the cell in A's row, B's column) will be high, matrix[B][A] will be low
if A is definitely worse than B, matrix[A][B] will be low, matrix[B][A] will be high
if A is about the same as B, or little or nothing is known, matrix[A][B] and matrix[B][A] are both low or zero
self-comparisons (cells on the diagonal) are zero by convention

For example, this is a preference matrix for ice cream flavors:

Flavor	vs. 🍫	vs. 🍓	vs. 🍦	vs. 💩	vs. 🦄
🍫 Chocolate	-	2	2	4	0
🍓 Strawberry	1	-	0	5	0
🍦 Vanilla	3	0	-	5	0
💩 Poop	1	0	0	-	0
🦄 Unicorn Milk	5	5	5	5	-

(Poop is unpopular, everyone loves unicorn milk, and vanilla ~> chocolate ~> strawberry)

The actual betterness metric depends on the domain. For sports, it might be the fraction of games won against that opponent. For ice cream, one might use consumer surveys. For TERPECA, betterness is based on voter rankings, and will be discussed in detail below. Crucially, betterness must be normalized by comparison count, so each row's sum is the average comparative betterness for that row's item -- a value which is still dependent on strength of schedule, but not on comparison count.

Then, the eigenvector of the preference matrix with the largest eigenvalue is found, which is a schedule-independent "strength score" for every item, usable for ranking.

For example, these are strength scores for the ice cream matrix above (via this online eigenvector calculator):

Flavor	Strength	Remultiplication cross-check
🍫 Chocolate	0.292	20.150 + 20.300 + 40.071 = 1.184 ~= 0.292 4.1
🍓 Strawberry	0.158	10.292 + 50.071 = 0.642 ~= 0.158 * 4.1
🍦 Vanilla	0.300	30.292 + 50.071 = 1.231 ~= 0.292 * 4.1
💩 Poop	0.071	10.292 = 0.292 ~= 0.071 4.1
🦄 Unicorn Milk	1.000	50.292 + 50.150 + 50.300 + 50.071 = 4.065 ~= 1.000 * 4.1

And these are ice cream flavors ranked by strength score:

Rank	Flavor	Strength
#1 🏆	🦄 Unicorn Milk	1.000
#2	🍦 Vanilla	0.300
#3	🍫 Chocolate	0.292
#4	🍓 Strawberry	0.158
#5	💩 Poop	0.071

It may seem like quite a leap of logic to suddenly apply an abstract concept from linear algebra! But in reality the matrix has been carefully crafted for this to work, which is why the comparison scores are non-negative confidence measures, and why each row is normalized by comparison count. There are some intuitive reasons why this makes sense, as it rewards items which we know to be better than other high-scoring items.

By the Perron-Frobenius theorem, there is one unique eigenvector with the largest eigenvalue, if every betterness value is nonnegative and the overall matrix is "irreducible", meaning there is a path of nonzero comparison from every item to every other item. (Otherwise there would be multiple "islands" of items with no comparison between the islands.)

This technique is not magically perfect, but it does have some good properties:

it is largely independent of comparison count and strength of schedule
it captures transitivity (if X > Y and Y > Z, then X > Z)
it's "smooth"; small winningness changes typically yield small strength changes
it can be related quantitatively to probability, if the subject can be modeled that way
it can be readily computed with an iterative approach

Importantly, for subjective ratings like TERPECA, there is no perfect way to determine winningness values; the formula has to be empirically tuned for sensible outcomes. In fact, most of the complexity of TERPECA ranking lies in the computing pairwise winningness from voter rankings...

Pairwise winningness in TERPECA

TERPECA uses a multi-factor strategy to calculate the winningness of each X vs. Y comparison:

tally direct "wins" and "losses" when X appears above or below Y in a list
tally indirect ("secondary") wins and losses when X beats P and P beats Y directly
using the Wilson confidence interval midpoint for combined wins and losses

These steps and their wrinkles are described in more detail below.

Tallies of direct wins and losses between rooms

For every X vs. Y comparison, "wins" (when X is above Y in a voter's ranking) and "losses" (when X is below Y) are counted in both weighted and unweighted totals across all voter rankings:

flowchart LR

subgraph For every voter ranking list
  subgraph For each room X on that list
    portion("p = 1 / versions ranked") --o roomWinUnw & roomLossUnw
    portion --> weight("weight =<br>p / sqrt(list size)") --o roomWin & roomLoss
    subgraph For each room Y below X
      roomWinUnw("unweightedWins<br>[X vs. Y] += p")
      roomWin("weightedWins<br>[X vs. Y] += weight")
    end
    subgraph For each room Y above X
      roomLossUnw("unweightedLosses<br>[X vs. Y] += p")
      roomLoss("weightedLosses<br>[X vs. Y] += weight")
    end
  end
end

In the special case where variants of a room were ranked separately but ultimately scored as one room, and the same voter ranked multiple versions in the same list, wins and losses are apportioned fractionally (p in the diagram above) for weighted and unweighted tallies. This is rare.

Inputs to weighted tallies are further divided of the square root of the voter's list size, to soften the effect of voters with long lists.

For example, if a voter submitted this ranking:

Rank	Room
1	😣 Pain Room (hard mode)
2	🦄 Fluffy Unicorns
3	😫 Pain Room (hard with pain)
4	🔢 Escape The Eigenvector
5	😔 Pain Room (easy mode)
6	💩 Poop Ice Cream: The Escape Room Experience

This voter would have a weight of 1/sqrt(6) ~= 0.408, and if Pain Room variants are merged for scoring, variant-specific wins and losses would be divided by 3, so the following wins and losses would be tallied:

Comparison	Wins	Weighted wins	Losses	Weighted losses	notes
😔😣😫 vs. 🦄	0.333	0.136	0.667	0.272	1 variant wins, 2 lose
😔😣😫 vs. 🔢	0.667	0.272	0.333	0.136	2 variants win, 1 loses
😔😣😫 vs. 💩	1.0	0.408	0	0	all variants win
🦄 vs. 😔😣😫	0.667	0.272	0.333	0.136	wins against 2 variants, loses against 1
🦄 vs. 🔢	1.0	0.408	0	0	wins
🦄 vs. 💩	1.0	0.408	0	0	wins
🔢 vs. 🦄	0	0	1.0	0.408	loses
🔢 vs. 😔😣😫	0.333	0.136	0.667	0.272	wins against 1 variant, loses against 2
🔢 vs. 💩	1.0	0.408	0	0	wins
💩 vs. 😔😣😫	0	0	1.0	0.408	loses
💩 vs. 🦄	0	0	1.0	0.408	loses
💩 vs. 🔢	0	0	1.0	0.408	loses

(Note the symmetry that the scores for X vs. Y are the win<>loss mirror of Y vs. X.)

Weighted and unweighted wins and losses are tallied across all valid rankings for every pair of rooms, and these values are used in subsequent steps.

Wilson confidence scores for direct wins and losses

For the eigenvector algorithm to work well, preference matrix betterness scores should represent the confidence that one room should be ranked higher than another. For example, 70/100 voters ranking X above Y gives more confidence 4/5 voters doing so.

For every X vs. Y comparison, TERPECA uses the Wilson confidence interval midpoint with a p-value of 0.05 to convert modified win/loss counts to confidence values:

flowchart LR

subgraph for every room X
  subgraph "for every room Y (not X)"
    reinflation("reinflation for X vs. Y =<br>(unweighted wins + losses)<br>/ (weighted wins + losses)")
    reinflation --> reinflatedWins & reinflatedLosses
    weightedWins("weightedWins") --> reinflatedWins
    weightedLosses("weightedLosses") --> reinflatedLosses
    reinflatedWins((×)) --> directWilson
    reinflatedLosses((×)) --> directWilson
    directWilson((Wilson<br>conf.)) --> directConfidence("directConfidence<br>[X vs. Y]")
  end
end

The weighted wins and losses from earlier are "reinflated" by the average weighting before the Wilson confidence midpoint is computed. The weighting and "reinflation" means that while the influence of "supervoters" on the ranking of a room is reduced, the actual number of comparisons made is still taken into account for determining confidence.

For example, if rooms had these wins and losses (symmetric cases elided):

Comparison	Wins	Weighted wins	Losses	Weighted losses
🍎 vs. 🍌	70	30	30	20
🍎 vs. 🍒	4	3	8	6
🍌 vs. 🍒	1	1	0	0

The computed values would be as follows:

Comparison	Reinflation factor	Reinflated wins	Reinflated losses	Wilson score
🍎 vs. 🍌	(70+30)/(30+20)=2	60	40	0.596
🍎 vs. 🍒	(4+8)/(3+6)=1.5	4.5	9	0.370
🍌 vs. 🍒	(1+0)/(1+0)=1	1	0	0.211

Note again that TERPECA uses the Wilson confidence interval midpoint, not the more commonly used lower bound! With p=0.05, the midpoint is roughly equivalent to always adding two wins and two losses, then taking the simple win fraction.

This "directConfidence" value would be suitable for a preference matrix, but TERPECA has more factors to consider...

Tallying indirect "secondary" wins and losses

The eigenvalue method does infer transitive relations (if A beats B, and B beats C, then A beats C), but TERPECA adds more explicit weighting for transitive relations, which empirically improves ranking quality given the sparse nature of direct comparisons. This may be the most subtle aspect of the ranking algorithm.

flowchart LR

subgraph For every room X
  subgraph "For every room Y (not X)"
    subgraph "For every room P (not X, Y)"
      pivotFilter("valid pivot?") --> pivotWinConfidence & pivotLossConfidence
      pivotWinConfidence("confidence<br>for X > P > Y")
      pivotLossConfidence("confidence<br>for X < P < Y")
    end

    pivotWinConfidence --o indirectWinConfidence("indirectWinConfidence<br>for X > (all P) > Y")
    pivotLossConfidence --o indirectLossConfidence("indirectLossConfidence<br>for X < (all P) < Y")
    indirectWinConfidence & indirectLossConfidence --> reverseConfidence
    reverseConfidence((inv.<br>Wilson)) --> indirectWins & indirectLosses
    indirectWins("indirectWins<br>[X vs. Y]")
    indirectLosses("indirectLosses<br>[X vs. Y]")
  end
end

For every possible X vs. Y comparison (even when no direct comparisons exist), all other rooms are examined for suitability as tertiary "pivots". A room P is a valid pivot if:

P is neither X nor Y
direct comparisons are available between X and P and also between Y and P
X beats P (directConfidence[X vs. P] > 0.5) and P doesn't lose to Y (directConfidence[Y vs. P] < 0.5)
or the other way around (Y beats P and P doesn't lose to X)

For valid pivots, a "win confidence" (confidence in X > P > Y) and a "loss confidence" (confidence in X < P < Y) are computed as if by this pseudocode:

pivotStrength =
    min(weightedWins[X vs. P], weightedLosses[Y vs. P]) + 
    min(weightedLosses[X vs. P], weightedWins[Y vs. P]))

scale =
    min(pivotStrength, sqrt(pivotStrength)) /  // to reduce the impact of any one pivot
    (winConfidence[X vs. P] + winConfidence[Y vs. P])

indirectWinConfidence[X vs. Y] += scale * winConfidence[X vs. P]
indirectLossConfidence[X vs. Y] += scale * winConfidence[Y vs. P]

Finally, these confidence values (derived from winConfidence values which are Wilson confidence midpoints) are put through an inverse of the Wilson confidence midpoint computation to produce synthetic "secondaryWins[X vs. Y]" and "secondaryLosses[X vs. Y]" values.

If this all seems a little arbitrary and magical, it is largely driven by empirical observations over the years!

Final winningness values for the preference matrix

As a last step, for every X vs. Y comparison, wins and losses from direct comparisons (weighted) and secondary comparisons (synthetic, as above) are added into total win/loss scores for another Wilson confidence evaluation. (Note that this does not use the directConfidence value for the pair, which only feeds into the indirect comparisons.)

flowchart LR
subgraph For every room X
  subgraph "For every room Y (not X)"
    directWins("weightedWins") --> allWinsAdd
    indirectWins("indirectWins") --> allWinsAdd
    allWinsAdd((#plus;)) --> finalWilson

    directLosses("weightedLosses") --> allLossesAdd
    indirectLosses("indirectLosses") --> allLossesAdd
    allLossesAdd((#plus;)) --> finalWilson

    finalWilson((Wilson<br>conf.)) --> finalConfidence("matrix[X vs. Y]")
  end
end

Once the pairwise winningness values are computed as above, the power iteration method is used to find the eigenvector with the largest eigenvalue. As discussed above, this eigenvector is directly usable for sorting the final ranked list.

A variety of reports and diagnostics are available from this process, which are outside the scope of this document.

References

Literature on ranking systems

Keener, J. The Perron-Frobenius Theorem and the Ranking of Football Teams
Newman, M.E.J. Efficient Computation of Rankings from Pairwise Computations
Langville, A. Who's #1? The Science of Rating and Ranking
Miller, E. How Not To Sort By Average Rating
Vang, J. Data Science Topics, the "Rating and Ranking" section
Wikipedia, Binomial proportion confidence interval, the "Wilson confidence interval" section
Wikipedia, Bradley-Terry model
Wikipedia, Sports Rating System
Wikipedia, Pairwise Comparison (Psychology)

Code in voter-portal/.../analyze.component.ts (only visible to project members):

egnor/TERPECA_algorithm.md