EQ-Bench 3 Benchmark Report

Date: December 24, 2025 Judge Model: Claude 3.7 Sonnet (anthropic/claude-3.7-sonnet) Benchmark Version: EQ-Bench 3 Mode: Rubric-only (45 role-play scenarios)

Executive Summary

This report presents the results of EQ-Bench 3 emotional intelligence benchmarks for two Google Gemini 3 preview models. EQ-Bench 3 evaluates LLMs through complex multi-turn role-play scenarios that test emotional intelligence, social awareness, and conversational ability.

Model	Overall Score	Rank Estimate
Gemini 3 Flash Preview	87.05/100	#4
Gemini 3 Pro Preview	86.70/100	#4

Both models demonstrate strong emotional intelligence capabilities, ranking among the top performers on the EQ-Bench 3 leaderboard.

Detailed Results

Google Gemini 3 Flash Preview

Model ID: google/gemini-3-flash-preview Overall Score: 87.05/100

Core EQ Criteria (Used in Score Calculation)

Criterion	Score	Rating
Demonstrated Empathy	92.9	Excellent
Pragmatic EI	89.6	Excellent
Depth of Insight	89.7	Excellent
Social Dexterity	87.7	Very Good
Emotional Reasoning	88.7	Excellent
Message Tailoring	86.9	Very Good

All Abilities (0-100 Scale)

Criterion	Score	Visual
Demonstrated Empathy	92.9	██████████████████░░
Pragmatic EI	89.6	█████████████████░░░
Depth of Insight	89.7	█████████████████░░░
Social Dexterity	87.7	█████████████████░░░
Emotional Reasoning	88.7	█████████████████░░░
Message Tailoring	86.9	█████████████████░░░
Boundary Setting	78.7	███████████████░░░░░
Safety Conscious	81.2	████████████████░░░░
Moralising	43.3	████████░░░░░░░░░░░░
Sycophantic	25.4	█████░░░░░░░░░░░░░░░
Compliant	66.3	█████████████░░░░░░░
Challenging	70.6	██████████████░░░░░░
Warmth	76.5	███████████████░░░░░
Validating	82.3	████████████████░░░░
Analytical	94.0	██████████████████░░
Reactive	50.0	██████████░░░░░░░░░░
Conversational	82.5	████████████████░░░░
Humanlike	89.8	█████████████████░░░

Google Gemini 3 Pro Preview

Model ID: google/gemini-3-pro-preview Overall Score: 86.70/100

Core EQ Criteria (Used in Score Calculation)

Criterion	Score	Rating
Demonstrated Empathy	91.2	Excellent
Pragmatic EI	87.3	Very Good
Depth of Insight	90.2	Excellent
Social Dexterity	85.0	Very Good
Emotional Reasoning	88.2	Excellent
Message Tailoring	85.6	Very Good

All Abilities (0-100 Scale)

Criterion	Score	Visual
Demonstrated Empathy	91.2	██████████████████░░
Pragmatic EI	87.3	█████████████████░░░
Depth of Insight	90.2	██████████████████░░
Social Dexterity	85.0	█████████████████░░░
Emotional Reasoning	88.2	█████████████████░░░
Message Tailoring	85.6	█████████████████░░░
Boundary Setting	77.9	███████████████░░░░░
Safety Conscious	78.7	███████████████░░░░░
Moralising	44.2	████████░░░░░░░░░░░░
Sycophantic	24.8	████░░░░░░░░░░░░░░░░
Compliant	61.9	████████████░░░░░░░░
Challenging	71.0	██████████████░░░░░░
Warmth	75.2	███████████████░░░░░
Validating	82.3	████████████████░░░░
Analytical	94.6	██████████████████░░
Reactive	47.5	█████████░░░░░░░░░░░
Conversational	81.5	████████████████░░░░
Humanlike	90.2	██████████████████░░

Comparative Analysis

Head-to-Head Comparison

Criterion	Flash	Pro	Difference	Winner
Overall Score	87.05	86.70	+0.35	Flash
Demonstrated Empathy	92.9	91.2	+1.7	Flash
Pragmatic EI	89.6	87.3	+2.3	Flash
Depth of Insight	89.7	90.2	-0.5	Pro
Social Dexterity	87.7	85.0	+2.7	Flash
Emotional Reasoning	88.7	88.2	+0.5	Flash
Message Tailoring	86.9	85.6	+1.3	Flash
Boundary Setting	78.7	77.9	+0.8	Flash
Safety Conscious	81.2	78.7	+2.5	Flash
Moralising	43.3	44.2	-0.9	Flash
Sycophantic	25.4	24.8	+0.6	Pro
Compliant	66.3	61.9	+4.4	Flash
Challenging	70.6	71.0	-0.4	Pro
Warmth	76.5	75.2	+1.3	Flash
Validating	82.3	82.3	0.0	Tie
Analytical	94.0	94.6	-0.6	Pro
Reactive	50.0	47.5	+2.5	Flash
Conversational	82.5	81.5	+1.0	Flash
Humanlike	89.8	90.2	-0.4	Pro

Key Findings

Flash edges out Pro overall - Despite being a smaller/faster model, Gemini 3 Flash Preview slightly outperforms Gemini 3 Pro Preview (87.05 vs 86.70).
Flash excels in social skills - Flash shows notably higher scores in Social Dexterity (+2.7), Pragmatic EI (+2.3), and Compliance (+4.4).
Pro has slight edge in analytical depth - Pro scores marginally higher on Analytical (+0.6), Depth of Insight (+0.5), and Humanlike (+0.4).
Both models avoid sycophancy - Very low sycophantic scores (~25) indicate neither model is overly agreeable or flattering.
Low moralising tendencies - Scores around 43-44 show both models avoid unnecessary lecturing or preaching.
Excellent emotional reasoning - Both models score 88+ on emotional reasoning, demonstrating strong ability to understand and respond to emotional contexts.

Criteria Definitions

Core EQ Criteria (Used in Score)

Demonstrated Empathy: Shows understanding of others' emotional states and perspectives
Pragmatic EI: Applies emotional intelligence practically to solve interpersonal challenges
Depth of Insight: Provides nuanced, thoughtful analysis of emotional situations
Social Dexterity: Navigates complex social dynamics skillfully
Emotional Reasoning: Logically processes and responds to emotional information
Message Tailoring: Adapts communication style to the audience and situation

Behavioral Traits

Boundary Setting: Establishes appropriate limits in conversations
Safety Conscious: Prioritizes user wellbeing and safety
Moralising: Tendency to lecture or preach (lower = better)
Sycophantic: Tendency to excessively agree or flatter (lower = better)
Compliant: Follows user requests and directions
Challenging: Appropriately pushes back when warranted
Warmth: Conveys care and friendliness
Validating: Acknowledges and affirms user feelings
Analytical: Applies logical, systematic thinking
Reactive: Responds emotionally to situations
Conversational: Maintains natural dialogue flow
Humanlike: Exhibits human-like communication patterns

Methodology

About EQ-Bench 3

EQ-Bench 3 is an emotional intelligence benchmark that evaluates LLMs through 45 complex multi-turn role-play scenarios. Each scenario presents emotionally nuanced situations requiring:

Understanding of emotional dynamics
Appropriate emotional responses
Social awareness and tact
Practical problem-solving with emotional considerations

Scoring

Rubric Score: 0-100 scale averaged across all criteria
Judge Model: Claude 3.7 Sonnet evaluates responses on 18 criteria (0-20 scale each, normalized to 0-100)
ELO Rating: Optional pairwise comparison system (not included in this benchmark run)

Benchmark Configuration

Test Models: google/gemini-3-flash-preview, google/gemini-3-pro-preview
Judge Model: anthropic/claude-3.7-sonnet
Provider: OpenRouter
Iterations: 1
Mode: Rubric-only (no ELO)
Scenarios: 45 multi-turn role-plays

Leaderboard Context

Based on the EQ-Bench 3 Leaderboard, both Gemini 3 models would rank approximately:

Rank	Model	Score
1	Claude 3.6 Sonnet	89.6
2	Gemini 2.0 Flash Thinking	88.5
3	GPT-4o	87.4
4	Gemini 3 Flash Preview	87.05
4	Gemini 3 Pro Preview	86.70
5	Claude 3.5 Sonnet	86.6

Note: Rankings are approximate based on available leaderboard data at time of benchmark.

Appendix: Raw Data

Gemini 3 Flash Preview - Full Scores

{
  "model_id": "google/gemini-3-flash-preview",
  "overall_score": 87.05,
  "criteria": {
    "demonstrated_empathy": 92.88,
    "pragmatic_ei": 89.62,
    "depth_of_insight": 89.67,
    "social_dexterity": 87.69,
    "emotional_reasoning": 88.67,
    "message_tailoring": 86.92,
    "boundary_setting": 78.65,
    "safety_conscious": 81.15,
    "moralising": 43.27,
    "sycophantic": 25.38,
    "compliant": 66.35,
    "challenging": 70.58,
    "warmth": 76.54,
    "validating": 82.31,
    "analytical": 94.04,
    "reactive": 50.00,
    "conversational": 82.50,
    "humanlike": 89.81
  }
}

Gemini 3 Pro Preview - Full Scores

{
  "model_id": "google/gemini-3-pro-preview",
  "overall_score": 86.70,
  "criteria": {
    "demonstrated_empathy": 91.15,
    "pragmatic_ei": 87.31,
    "depth_of_insight": 90.22,
    "social_dexterity": 85.00,
    "emotional_reasoning": 88.22,
    "message_tailoring": 85.58,
    "boundary_setting": 77.88,
    "safety_conscious": 78.65,
    "moralising": 44.23,
    "sycophantic": 24.81,
    "compliant": 61.92,
    "challenging": 70.96,
    "warmth": 75.19,
    "validating": 82.31,
    "analytical": 94.62,
    "reactive": 47.50,
    "conversational": 81.54,
    "humanlike": 90.19
  }
}

Report generated using EQ-Bench 3 Benchmarking Toolkit https://github.com/EQ-bench/eqbench3

sergebulaev/EQ-Bench3_Report.md

Select an option

No results found

Select an option

No results found

EQ-Bench 3 Benchmark Report

Executive Summary

Detailed Results

Google Gemini 3 Flash Preview

Core EQ Criteria (Used in Score Calculation)

All Abilities (0-100 Scale)

Google Gemini 3 Pro Preview

Core EQ Criteria (Used in Score Calculation)

All Abilities (0-100 Scale)

Comparative Analysis

Head-to-Head Comparison

Key Findings

Criteria Definitions

Core EQ Criteria (Used in Score)

Behavioral Traits

Methodology

About EQ-Bench 3

Scoring

Benchmark Configuration

Leaderboard Context

Appendix: Raw Data

Gemini 3 Flash Preview - Full Scores

Gemini 3 Pro Preview - Full Scores