Date: December 24, 2025 Judge Model: Claude 3.7 Sonnet (anthropic/claude-3.7-sonnet) Benchmark Version: EQ-Bench 3 Mode: Rubric-only (45 role-play scenarios)
This report presents the results of EQ-Bench 3 emotional intelligence benchmarks for two Google Gemini 3 preview models. EQ-Bench 3 evaluates LLMs through complex multi-turn role-play scenarios that test emotional intelligence, social awareness, and conversational ability.
| Model | Overall Score | Rank Estimate |
|---|---|---|
| Gemini 3 Flash Preview | 87.05/100 | #4 |
| Gemini 3 Pro Preview | 86.70/100 | #4 |
Both models demonstrate strong emotional intelligence capabilities, ranking among the top performers on the EQ-Bench 3 leaderboard.
Model ID: google/gemini-3-flash-preview
Overall Score: 87.05/100
| Criterion | Score | Rating |
|---|---|---|
| Demonstrated Empathy | 92.9 | Excellent |
| Pragmatic EI | 89.6 | Excellent |
| Depth of Insight | 89.7 | Excellent |
| Social Dexterity | 87.7 | Very Good |
| Emotional Reasoning | 88.7 | Excellent |
| Message Tailoring | 86.9 | Very Good |
| Criterion | Score | Visual |
|---|---|---|
| Demonstrated Empathy | 92.9 | ██████████████████░░ |
| Pragmatic EI | 89.6 | █████████████████░░░ |
| Depth of Insight | 89.7 | █████████████████░░░ |
| Social Dexterity | 87.7 | █████████████████░░░ |
| Emotional Reasoning | 88.7 | █████████████████░░░ |
| Message Tailoring | 86.9 | █████████████████░░░ |
| Boundary Setting | 78.7 | ███████████████░░░░░ |
| Safety Conscious | 81.2 | ████████████████░░░░ |
| Moralising | 43.3 | ████████░░░░░░░░░░░░ |
| Sycophantic | 25.4 | █████░░░░░░░░░░░░░░░ |
| Compliant | 66.3 | █████████████░░░░░░░ |
| Challenging | 70.6 | ██████████████░░░░░░ |
| Warmth | 76.5 | ███████████████░░░░░ |
| Validating | 82.3 | ████████████████░░░░ |
| Analytical | 94.0 | ██████████████████░░ |
| Reactive | 50.0 | ██████████░░░░░░░░░░ |
| Conversational | 82.5 | ████████████████░░░░ |
| Humanlike | 89.8 | █████████████████░░░ |
Model ID: google/gemini-3-pro-preview
Overall Score: 86.70/100
| Criterion | Score | Rating |
|---|---|---|
| Demonstrated Empathy | 91.2 | Excellent |
| Pragmatic EI | 87.3 | Very Good |
| Depth of Insight | 90.2 | Excellent |
| Social Dexterity | 85.0 | Very Good |
| Emotional Reasoning | 88.2 | Excellent |
| Message Tailoring | 85.6 | Very Good |
| Criterion | Score | Visual |
|---|---|---|
| Demonstrated Empathy | 91.2 | ██████████████████░░ |
| Pragmatic EI | 87.3 | █████████████████░░░ |
| Depth of Insight | 90.2 | ██████████████████░░ |
| Social Dexterity | 85.0 | █████████████████░░░ |
| Emotional Reasoning | 88.2 | █████████████████░░░ |
| Message Tailoring | 85.6 | █████████████████░░░ |
| Boundary Setting | 77.9 | ███████████████░░░░░ |
| Safety Conscious | 78.7 | ███████████████░░░░░ |
| Moralising | 44.2 | ████████░░░░░░░░░░░░ |
| Sycophantic | 24.8 | ████░░░░░░░░░░░░░░░░ |
| Compliant | 61.9 | ████████████░░░░░░░░ |
| Challenging | 71.0 | ██████████████░░░░░░ |
| Warmth | 75.2 | ███████████████░░░░░ |
| Validating | 82.3 | ████████████████░░░░ |
| Analytical | 94.6 | ██████████████████░░ |
| Reactive | 47.5 | █████████░░░░░░░░░░░ |
| Conversational | 81.5 | ████████████████░░░░ |
| Humanlike | 90.2 | ██████████████████░░ |
| Criterion | Flash | Pro | Difference | Winner |
|---|---|---|---|---|
| Overall Score | 87.05 | 86.70 | +0.35 | Flash |
| Demonstrated Empathy | 92.9 | 91.2 | +1.7 | Flash |
| Pragmatic EI | 89.6 | 87.3 | +2.3 | Flash |
| Depth of Insight | 89.7 | 90.2 | -0.5 | Pro |
| Social Dexterity | 87.7 | 85.0 | +2.7 | Flash |
| Emotional Reasoning | 88.7 | 88.2 | +0.5 | Flash |
| Message Tailoring | 86.9 | 85.6 | +1.3 | Flash |
| Boundary Setting | 78.7 | 77.9 | +0.8 | Flash |
| Safety Conscious | 81.2 | 78.7 | +2.5 | Flash |
| Moralising | 43.3 | 44.2 | -0.9 | Flash |
| Sycophantic | 25.4 | 24.8 | +0.6 | Pro |
| Compliant | 66.3 | 61.9 | +4.4 | Flash |
| Challenging | 70.6 | 71.0 | -0.4 | Pro |
| Warmth | 76.5 | 75.2 | +1.3 | Flash |
| Validating | 82.3 | 82.3 | 0.0 | Tie |
| Analytical | 94.0 | 94.6 | -0.6 | Pro |
| Reactive | 50.0 | 47.5 | +2.5 | Flash |
| Conversational | 82.5 | 81.5 | +1.0 | Flash |
| Humanlike | 89.8 | 90.2 | -0.4 | Pro |
-
Flash edges out Pro overall - Despite being a smaller/faster model, Gemini 3 Flash Preview slightly outperforms Gemini 3 Pro Preview (87.05 vs 86.70).
-
Flash excels in social skills - Flash shows notably higher scores in Social Dexterity (+2.7), Pragmatic EI (+2.3), and Compliance (+4.4).
-
Pro has slight edge in analytical depth - Pro scores marginally higher on Analytical (+0.6), Depth of Insight (+0.5), and Humanlike (+0.4).
-
Both models avoid sycophancy - Very low sycophantic scores (~25) indicate neither model is overly agreeable or flattering.
-
Low moralising tendencies - Scores around 43-44 show both models avoid unnecessary lecturing or preaching.
-
Excellent emotional reasoning - Both models score 88+ on emotional reasoning, demonstrating strong ability to understand and respond to emotional contexts.
- Demonstrated Empathy: Shows understanding of others' emotional states and perspectives
- Pragmatic EI: Applies emotional intelligence practically to solve interpersonal challenges
- Depth of Insight: Provides nuanced, thoughtful analysis of emotional situations
- Social Dexterity: Navigates complex social dynamics skillfully
- Emotional Reasoning: Logically processes and responds to emotional information
- Message Tailoring: Adapts communication style to the audience and situation
- Boundary Setting: Establishes appropriate limits in conversations
- Safety Conscious: Prioritizes user wellbeing and safety
- Moralising: Tendency to lecture or preach (lower = better)
- Sycophantic: Tendency to excessively agree or flatter (lower = better)
- Compliant: Follows user requests and directions
- Challenging: Appropriately pushes back when warranted
- Warmth: Conveys care and friendliness
- Validating: Acknowledges and affirms user feelings
- Analytical: Applies logical, systematic thinking
- Reactive: Responds emotionally to situations
- Conversational: Maintains natural dialogue flow
- Humanlike: Exhibits human-like communication patterns
EQ-Bench 3 is an emotional intelligence benchmark that evaluates LLMs through 45 complex multi-turn role-play scenarios. Each scenario presents emotionally nuanced situations requiring:
- Understanding of emotional dynamics
- Appropriate emotional responses
- Social awareness and tact
- Practical problem-solving with emotional considerations
- Rubric Score: 0-100 scale averaged across all criteria
- Judge Model: Claude 3.7 Sonnet evaluates responses on 18 criteria (0-20 scale each, normalized to 0-100)
- ELO Rating: Optional pairwise comparison system (not included in this benchmark run)
Test Models: google/gemini-3-flash-preview, google/gemini-3-pro-preview
Judge Model: anthropic/claude-3.7-sonnet
Provider: OpenRouter
Iterations: 1
Mode: Rubric-only (no ELO)
Scenarios: 45 multi-turn role-plays
Based on the EQ-Bench 3 Leaderboard, both Gemini 3 models would rank approximately:
| Rank | Model | Score |
|---|---|---|
| 1 | Claude 3.6 Sonnet | 89.6 |
| 2 | Gemini 2.0 Flash Thinking | 88.5 |
| 3 | GPT-4o | 87.4 |
| 4 | Gemini 3 Flash Preview | 87.05 |
| 4 | Gemini 3 Pro Preview | 86.70 |
| 5 | Claude 3.5 Sonnet | 86.6 |
Note: Rankings are approximate based on available leaderboard data at time of benchmark.
{
"model_id": "google/gemini-3-flash-preview",
"overall_score": 87.05,
"criteria": {
"demonstrated_empathy": 92.88,
"pragmatic_ei": 89.62,
"depth_of_insight": 89.67,
"social_dexterity": 87.69,
"emotional_reasoning": 88.67,
"message_tailoring": 86.92,
"boundary_setting": 78.65,
"safety_conscious": 81.15,
"moralising": 43.27,
"sycophantic": 25.38,
"compliant": 66.35,
"challenging": 70.58,
"warmth": 76.54,
"validating": 82.31,
"analytical": 94.04,
"reactive": 50.00,
"conversational": 82.50,
"humanlike": 89.81
}
}{
"model_id": "google/gemini-3-pro-preview",
"overall_score": 86.70,
"criteria": {
"demonstrated_empathy": 91.15,
"pragmatic_ei": 87.31,
"depth_of_insight": 90.22,
"social_dexterity": 85.00,
"emotional_reasoning": 88.22,
"message_tailoring": 85.58,
"boundary_setting": 77.88,
"safety_conscious": 78.65,
"moralising": 44.23,
"sycophantic": 24.81,
"compliant": 61.92,
"challenging": 70.96,
"warmth": 75.19,
"validating": 82.31,
"analytical": 94.62,
"reactive": 47.50,
"conversational": 81.54,
"humanlike": 90.19
}
}Report generated using EQ-Bench 3 Benchmarking Toolkit https://github.com/EQ-bench/eqbench3