(Or, rather, I had an LLM write it for me. (But another LLM checked it and said it was correct, so...))
When a battle is voted as a tie, the "ideal model" is also considered to have tied with both. When a battle is voted as "both bad", then the ideal model is considered to have beaten both. So it acts as an upper bound for Elo scores, and since the judgments are from humans, a model that scores that well all the time would be human-equivalent?
https://gist.github.com/endolith/e001d8b7811699cf9be822a774e7cb67
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/discussions/67