Skip to content

Instantly share code, notes, and snippets.

@maxim-saplin
Last active July 27, 2025 19:27
Show Gist options
  • Save maxim-saplin/8305c8d20a218c3d602d233b56b6f2af to your computer and use it in GitHub Desktop.
Save maxim-saplin/8305c8d20a218c3d602d233b56b6f2af to your computer and use it in GitHub Desktop.
LLM Chess Elo All Models

Engine Elo ratings: Player Games Elo ±95%CI

random 4000 -122.3 23.0 stockfish-lvl-1 1000 824.9 64.2

Stage 2: LLM Elo Ratings Player Games Elo ±95%CI

3x-o4-mini-2025-04-16-low_41mini-t03 33 114.6 133.1 3x-o4-mini-2025-04-16-low_o4-mini-2025-04-16-medium 33 210.2 121.3 5x-o4-mini-2025-04-16-low_o4-mini-2025-04-16-medium 33 127.7 130.9 7x-o4-mini-2025-04-16-low_o4-mini-2025-04-16-medium 15 137.8 191.8 claude-3-7-sonnet-20250219-thinking-budget-10000 33 5.4 159.0 claude-3-7-sonnet-20250219-thinking-budget-5000 33 72.4 141.4 gemini-25pro-t03_mini41-t00_mini41-t03 30 226.5 126.1 grok-3-mini-beta-high 200 451.7 54.6 o3-2025-04-16-low 200 753.0 64.0 o3-2025-04-16-medium_timeout-60m 11 nan nan o3-mini-2025-01-31-high 63 396.1 88.1 o3-mini-2025-01-31-low 33 -85.3 192.5 o3-mini-2025-01-31-medium 38 210.7 113.0 o4-mini-2025-04-16-high 98 346.4 69.9 o4-mini-2025-04-16-low 33 140.3 129.0 o4-mini-2025-04-16-medium 40 311.1 108.0

Stage 3: Unified Elo Ratings Player Games Elo ±95%CI

*stockfish-lvl-1 1000 824.9 64.2 o3-2025-04-16-low 200 753.0 64.0 dragon-lvl-5 0 750.0 0.0 dragon-lvl-4 0 625.0 0.0 dragon-lvl-3 0 500.0 0.0 grok-3-mini-beta-high 200 451.7 54.6 o3-mini-2025-01-31-high 63 396.1 88.1 dragon-lvl-2 0 375.0 0.0 o4-mini-2025-04-16-high 98 346.4 69.9 o4-mini-2025-04-16-medium 40 311.1 108.0 dragon-lvl-1 0 250.0 0.0 gemini-25pro-t03_mini41-t00_mini41-t03 30 226.5 126.1 o3-mini-2025-01-31-medium 38 210.7 113.0 3x-o4-mini-2025-04-16-low_o4-mini-2025-04-16-medium 33 210.2 121.3 o4-mini-2025-04-16-low 33 140.3 129.0 5x-o4-mini-2025-04-16-low_o4-mini-2025-04-16-medium 33 127.7 130.9 3x-o4-mini-2025-04-16-low_41mini-t03 33 114.6 133.1 claude-3-7-sonnet-20250219-thinking-budget-5000 33 72.4 141.4 claude-3-7-sonnet-20250219-thinking-budget-10000 33 5.4 159.0 o3-mini-2025-01-31-low 33 -85.3 192.5 *random 4000 -122.3 23.0

Stage 4: Extended Elo Ratings vs Random Player Games Elo ±95%CI

*stockfish-lvl-1 1000 824.9 64.2 o3-2025-04-16-low 200 753.0 64.0 dragon-lvl-5 0 750.0 0.0 dragon-lvl-4 0 625.0 0.0 dragon-lvl-3 0 500.0 0.0 grok-3-mini-beta-high 200 451.7 54.6 o3-mini-2025-01-31-high 63 396.1 88.1 dragon-lvl-2 0 375.0 0.0 o4-mini-2025-04-16-high 98 346.4 69.9 o4-mini-2025-04-16-medium 40 311.1 108.0 o3-2025-04-16-medium 53 305.6 160.0 o1-2024-12-17-medium 41 276.4 170.1 dragon-lvl-1 0 250.0 0.0 gemini-25pro-t03_mini41-t00_mini41-t03 30 226.5 126.1 o3-mini-2025-01-31-medium 38 210.7 113.0 3x-o4-mini-2025-04-16-low_o4-mini-2025-04-16-medium 33 210.2 121.3 non-gemini-25pro-t03_mini41-t00_mini41-t03 45 141.7 124.4 o4-mini-2025-04-16-low 33 140.3 129.0 o1-2024-12-17-low 47 140.0 121.4 5x-o4-mini-2025-04-16-low_o4-mini-2025-04-16-medium 33 127.7 130.9 3x-o4-mini-2025-04-16-low_41mini-t03 33 114.6 133.1 claude-3-7-sonnet-20250219-thinking-budget-5000 33 72.4 141.4 claude-v4-sonnet-thinking_16000 38 47.0 118.8 o1-preview-2024-09-12 30 46.3 133.6 claude-v4-opus-thinking_16000 34 29.3 123.4 claude-3-7-sonnet-20250219-thinking-budget-10000 33 5.4 159.0 non-r1-t03_mini41-t10_mini41-t03 35 4.1 119.1 claude-v3-7-sonnet-thinking_10000 37 -1.0 115.4 claude-v4-opus 37 -10.9 114.7 claude-v4-sonnet 37 -10.9 114.7 grok-3-mini-beta-low 59 -33.9 89.7 claude-v3-7-sonnet-thinking_1024 41 -36.1 107.5 claude-v3-7-sonnet-thinking_5000 41 -36.1 107.5 grok-3-mini-beta 42 -37.3 106.2 o1-mini-2024-09-12 30 -52.4 125.0 claude-v3-7-sonnet-thinking_2048 84 -62.4 74.5 gemini-2.5-pro-preview-05-06 42 -70.7 105.2 o3-mini-2025-01-31-low 33 -85.3 192.5 gpt-4-32k-0613 33 -97.8 118.6 non_gpt-4.1-mini-2025-04-14_t00_t07_t03 30 -98.9 124.4 qwen-max-2025-01-25 60 -98.9 88.0 gpt-4o-2024-11-20 71 -102.0 80.9 claude-v3-5-sonnet-v2 60 -104.7 88.0 claude-v3-5-sonnet-v1 60 -110.5 88.1 gpt-4-turbo-2024-04-09 30 -110.5 124.6 gpt-4.5-preview-2025-02-27 44 -111.0 102.9 gpt-4-0613 33 -118.9 119.0 *random 4000 -122.3 23.0 gpt-4.1-2025-04-14 80 -130.9 76.7 gpt-4o-2024-08-06 60 -133.9 88.7 grok-3-beta 42 -137.2 106.2 claude-v3-5-haiku 42 -137.2 106.2 gemini-2.5-pro-preview-03-25 43 -144.3 105.3 claude-v3-opus 30 -145.7 126.1 gemini-2.5-pro 41 -147.2 107.9 claude-v3-7-sonnet 42 -154.3 107.0 gpt-4o-2024-05-13 60 -157.7 89.7 deepseek-reasoner-r1 31 -216.2 130.8 gpt-4.1-mini-2025-04-14 84 -231.5 80.8 gpt-4o-mini-2024-07-18 30 -234.5 135.7 llama-3-70b-instruct-awq 30 -278.1 143.6 gemini-2.5-flash 42 -289.3 123.4 gemini-2.0-flash-001 67 -310.8 101.0 grok-2-1212 49 -334.8 123.0 gemini-1.5-flash-001 30 -366.9 166.8 non-haiku35-t07_haiku35-t10_haiku35-t03 42 -366.9 141.0 gemma-2-27b-it@q6_k_l 30 -412.4 182.9 llama-4-scout-cerebras 39 -464.1 179.7 gemma-2-9b-it-8bit 30 -545.7 249.2 deepseek-chat-v3-0324 45 -579.5 221.6 llama-3.3-70b 42 -607.7 246.7 qwen-plus-2025-01-25 33 -616.2 284.5 gemini-1.5-pro-preview-0409 40 -651.0 283.4 gemini-2.0-flash-exp 30 -672.2 346.3 qwen2.5-72b-instruct 30 -672.2 346.3 llama3.1-8b 90 -795.6 280.4 gemini-2.0-flash-lite-001 66 -812.4 343.1 deepseek-chat-v3 70 -822.8 342.9 amazon.nova-lite-v1 42 nan nan amazon.nova-pro-v1 33 nan nan chat-bison-32k@002 36 nan nan claude-v3-haiku 40 nan nan deephermes-3-llama-3-8b-preview@q8 42 nan nan deepseek-r1-distill-qwen-14b@q8_0 30 nan nan deepseek-r1-distill-qwen-32b@q4_k_m 30 nan nan gemini-2.0-flash-lite-preview-02-05 39 nan nan gemini-2.0-flash-thinking-exp-01-21 33 nan nan gemini-2.0-flash-thinking-exp-1219 30 nan nan gemma-3-12b-it@iq4_xs 134 nan nan gemma-3-12b-it@q8_0 67 nan nan gemma-3-27b-it@iq4_xs 67 nan nan gemma2-9b-it-groq 35 nan nan gpt-35-turbo-0125 30 nan nan gpt-35-turbo-0301 30 nan nan gpt-35-turbo-0613 30 nan nan gpt-35-turbo-1106 30 nan nan gpt-4.1-nano-2025-04-14 42 nan nan granite-3.1-8b-instruct 30 nan nan internlm3-8b-instruct 30 nan nan llama-2-7b-chat 30 nan nan llama-3.1-tulu-3-8b@q8_0 42 nan nan llama3-8b-8192 60 nan nan magistral-small-2506 42 nan nan mercury-coder-small 60 nan nan ministral-8b-instruct-2410 30 nan nan mistral-nemo-12b-instruct-2407 30 nan nan mistral-small-24b-instruct-2501@q4_k_m 42 nan nan mistral-small-instruct-2409 30 nan nan phi-4 30 nan nan qwen-turbo-2024-11-01 33 nan nan qwen2.5-14b-instruct-1m 42 nan nan qwen2.5-14b-instruct@q8_0 30 nan nan qwen2.5-7b-instruct-1m 42 nan nan qwen3-14b@iq4_xs-thinking 42 nan nan qwen3-32b-cerebras-thinking 47 nan nan qwq-32b 33 nan nan qwq-32b-preview@q4_k_m 30 nan nan sky-t1-32b-preview 30 nan nan

Note: Elo=NaN indicates no valid root was found (e.g., all wins or all losses), so Elo is undefined.

Dragon levels (theoretical Elo): lvl-1 : 250 lvl-2 : 375 lvl-3 : 500 lvl-4 : 625 lvl-5 : 750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment