Skip to content

Instantly share code, notes, and snippets.

@sergebulaev
Created December 25, 2025 14:52
Show Gist options
  • Select an option

  • Save sergebulaev/20b0065091b0bc66ac2ab849e18e9aee to your computer and use it in GitHub Desktop.

Select an option

Save sergebulaev/20b0065091b0bc66ac2ab849e18e9aee to your computer and use it in GitHub Desktop.
Google Gemini 3 Preview Models - EQ-Bench 3 Emotional Intelligence Benchmark Results | Research by LLM Index (https://llmindex.net)

Google Gemini 3 Preview Models - EQ-Bench 3 Emotional Intelligence Benchmark

Research by LLM Index
Date: December 25, 2025
Benchmark: EQ-Bench 3 (Emotional Intelligence)
Judge Model: Claude 3.7 Sonnet


Executive Summary

This research evaluates Google's newly released Gemini 3 preview models on the EQ-Bench 3 emotional intelligence benchmark. Both models demonstrate strong emotional intelligence capabilities, ranking among the top performers globally.

Model ELO Rating ELO Normalized Rubric Score Global Rank
Gemini 3 Pro Preview 1642.9 1458.6 17.27/20 #5
Gemini 3 Flash Preview 1622.3 1410.6 17.33/20 #5

Key Findings

  1. Pro outperforms Flash in ELO - Despite Flash having a marginally higher rubric score (17.33 vs 17.27), Pro achieves a significantly higher ELO rating (+20.7 points), indicating stronger performance in head-to-head comparisons against reference models.

  2. Both models rank #5 globally - Placing just below Gemini 2.5 Pro and above GPT-4o, ChatGPT-4o-latest, and Claude Opus 4.

  3. Competitive with flagship models - Both Gemini 3 previews outperform GPT-5 Chat, Claude Sonnet 4, and DeepSeek R1 in emotional intelligence.


Detailed Results

Gemini 3 Pro Preview

Model ID: google/gemini-3-pro-preview
ELO Rating: 1642.94
ELO Normalized: 1458.57
Confidence Interval: [1631.55, 1654.33]
Sigma: 5.81
Rubric Score: 17.27/20 (86.35%)

Gemini 3 Flash Preview

Model ID: google/gemini-3-flash-preview  
ELO Rating: 1622.25
ELO Normalized: 1410.62
Confidence Interval: [1612.64, 1631.85]
Sigma: 4.90
Rubric Score: 17.33/20 (86.65%)

Global Leaderboard Context

Based on EQ-Bench 3 pairwise comparisons against 47 reference models:

Rank Model ELO
1 Moonshot Kimi K2 Instruct 1709
2 OpenRouter Horizon Alpha 1702
3 OpenAI o3 1666-1672
4 Gemini 2.5 Pro Preview 1643
5 Gemini 3 Pro Preview 1643
5 Gemini 3 Flash Preview 1622
6 ChatGPT-4o-latest 1591-1597
7 GPT-5 Chat Latest 1584-1591
8 ChatGPT-4o-latest 1564-1567
9 GLM-4.5 1561
10 o4-mini 1550
11 Claude Opus 4 1549
12 Gemini 2.5 Pro (03-25) 1546
13 Qwen3 235B 1541
14 DeepSeek R1 1538
15 Claude Sonnet 4 1533

Methodology

About EQ-Bench 3

EQ-Bench 3 evaluates LLMs through 45 complex multi-turn role-play scenarios testing:

  • Demonstrated Empathy - Understanding others' emotional states
  • Pragmatic EI - Practical application of emotional intelligence
  • Depth of Insight - Nuanced analysis of emotional situations
  • Social Dexterity - Navigation of complex social dynamics
  • Emotional Reasoning - Logical processing of emotional information
  • Message Tailoring - Adapting communication style appropriately

Scoring Methods

  1. Rubric Score (0-20): Claude 3.7 Sonnet evaluates responses across 18 criteria
  2. ELO Rating: Pairwise comparisons against 47 reference models using TrueSkill algorithm

Benchmark Configuration

Test Models: google/gemini-3-pro-preview, google/gemini-3-flash-preview
Judge Model: anthropic/claude-3.7-sonnet
Provider: OpenRouter
Pairwise Comparisons: 578 per model
Reference Models: 47

Conclusions

Google's Gemini 3 preview models demonstrate excellent emotional intelligence capabilities:

  1. Pro is the better choice for EQ-sensitive tasks - Higher ELO indicates stronger real-world emotional reasoning
  2. Flash offers competitive EQ at lower cost - Only ~20 ELO points behind Pro with potentially faster inference
  3. Both outperform most flagship models - Including GPT-5 variants and Claude Sonnet 4
  4. Gemini 2.5 Pro remains slightly ahead - The 2.5 Pro still edges out 3 Pro by a small margin

References


Research conducted using the EQ-Bench 3 benchmarking toolkit
Data collected December 25, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment