Google Gemini 3 Preview Models - EQ-Bench 3 Emotional Intelligence Benchmark

Research by LLM Index
Date: December 25, 2025
Benchmark: EQ-Bench 3 (Emotional Intelligence)
Judge Model: Claude 3.7 Sonnet

Executive Summary

This research evaluates Google's newly released Gemini 3 preview models on the EQ-Bench 3 emotional intelligence benchmark. Both models demonstrate strong emotional intelligence capabilities, ranking among the top performers globally.

Model	ELO Rating	ELO Normalized	Rubric Score	Global Rank
Gemini 3 Pro Preview	1642.9	1458.6	17.27/20	#5
Gemini 3 Flash Preview	1622.3	1410.6	17.33/20	#5

Key Findings

Pro outperforms Flash in ELO - Despite Flash having a marginally higher rubric score (17.33 vs 17.27), Pro achieves a significantly higher ELO rating (+20.7 points), indicating stronger performance in head-to-head comparisons against reference models.
Both models rank #5 globally - Placing just below Gemini 2.5 Pro and above GPT-4o, ChatGPT-4o-latest, and Claude Opus 4.
Competitive with flagship models - Both Gemini 3 previews outperform GPT-5 Chat, Claude Sonnet 4, and DeepSeek R1 in emotional intelligence.

Detailed Results

Gemini 3 Pro Preview

Model ID: google/gemini-3-pro-preview
ELO Rating: 1642.94
ELO Normalized: 1458.57
Confidence Interval: [1631.55, 1654.33]
Sigma: 5.81
Rubric Score: 17.27/20 (86.35%)

Gemini 3 Flash Preview

Model ID: google/gemini-3-flash-preview  
ELO Rating: 1622.25
ELO Normalized: 1410.62
Confidence Interval: [1612.64, 1631.85]
Sigma: 4.90
Rubric Score: 17.33/20 (86.65%)

Global Leaderboard Context

Based on EQ-Bench 3 pairwise comparisons against 47 reference models:

Rank	Model	ELO
1	Moonshot Kimi K2 Instruct	1709
2	OpenRouter Horizon Alpha	1702
3	OpenAI o3	1666-1672
4	Gemini 2.5 Pro Preview	1643
5	Gemini 3 Pro Preview	1643
5	Gemini 3 Flash Preview	1622
6	ChatGPT-4o-latest	1591-1597
7	GPT-5 Chat Latest	1584-1591
8	ChatGPT-4o-latest	1564-1567
9	GLM-4.5	1561
10	o4-mini	1550
11	Claude Opus 4	1549
12	Gemini 2.5 Pro (03-25)	1546
13	Qwen3 235B	1541
14	DeepSeek R1	1538
15	Claude Sonnet 4	1533

Methodology

About EQ-Bench 3

EQ-Bench 3 evaluates LLMs through 45 complex multi-turn role-play scenarios testing:

Demonstrated Empathy - Understanding others' emotional states
Pragmatic EI - Practical application of emotional intelligence
Depth of Insight - Nuanced analysis of emotional situations
Social Dexterity - Navigation of complex social dynamics
Emotional Reasoning - Logical processing of emotional information
Message Tailoring - Adapting communication style appropriately

Scoring Methods

Rubric Score (0-20): Claude 3.7 Sonnet evaluates responses across 18 criteria
ELO Rating: Pairwise comparisons against 47 reference models using TrueSkill algorithm

Benchmark Configuration

Test Models: google/gemini-3-pro-preview, google/gemini-3-flash-preview
Judge Model: anthropic/claude-3.7-sonnet
Provider: OpenRouter
Pairwise Comparisons: 578 per model
Reference Models: 47

Conclusions

Google's Gemini 3 preview models demonstrate excellent emotional intelligence capabilities:

Pro is the better choice for EQ-sensitive tasks - Higher ELO indicates stronger real-world emotional reasoning
Flash offers competitive EQ at lower cost - Only ~20 ELO points behind Pro with potentially faster inference
Both outperform most flagship models - Including GPT-5 variants and Claude Sonnet 4
Gemini 2.5 Pro remains slightly ahead - The 2.5 Pro still edges out 3 Pro by a small margin

References

Research conducted using the EQ-Bench 3 benchmarking toolkit
Data collected December 25, 2025

sergebulaev/gemini3_eqbench_research.md

Select an option

No results found