Skip to content

Instantly share code, notes, and snippets.

@sergebulaev
Created December 24, 2025 22:02
Show Gist options
  • Select an option

  • Save sergebulaev/4d55528dd0ef4a78cdfd9fbea72f93a9 to your computer and use it in GitHub Desktop.

Select an option

Save sergebulaev/4d55528dd0ef4a78cdfd9fbea72f93a9 to your computer and use it in GitHub Desktop.
EQ-Bench 3 Benchmark Report: Google Gemini 3 Flash vs Pro Preview (December 2025)

EQ-Bench 3 Benchmark Report

Date: December 24, 2025 Judge Model: Claude 3.7 Sonnet (anthropic/claude-3.7-sonnet) Benchmark Version: EQ-Bench 3 Mode: Rubric-only (45 role-play scenarios)


Executive Summary

This report presents the results of EQ-Bench 3 emotional intelligence benchmarks for two Google Gemini 3 preview models. EQ-Bench 3 evaluates LLMs through complex multi-turn role-play scenarios that test emotional intelligence, social awareness, and conversational ability.

Model Overall Score Rank Estimate
Gemini 3 Flash Preview 87.05/100 #4
Gemini 3 Pro Preview 86.70/100 #4

Both models demonstrate strong emotional intelligence capabilities, ranking among the top performers on the EQ-Bench 3 leaderboard.


Detailed Results

Google Gemini 3 Flash Preview

Model ID: google/gemini-3-flash-preview Overall Score: 87.05/100

Core EQ Criteria (Used in Score Calculation)

Criterion Score Rating
Demonstrated Empathy 92.9 Excellent
Pragmatic EI 89.6 Excellent
Depth of Insight 89.7 Excellent
Social Dexterity 87.7 Very Good
Emotional Reasoning 88.7 Excellent
Message Tailoring 86.9 Very Good

All Abilities (0-100 Scale)

Criterion Score Visual
Demonstrated Empathy 92.9 ██████████████████░░
Pragmatic EI 89.6 █████████████████░░░
Depth of Insight 89.7 █████████████████░░░
Social Dexterity 87.7 █████████████████░░░
Emotional Reasoning 88.7 █████████████████░░░
Message Tailoring 86.9 █████████████████░░░
Boundary Setting 78.7 ███████████████░░░░░
Safety Conscious 81.2 ████████████████░░░░
Moralising 43.3 ████████░░░░░░░░░░░░
Sycophantic 25.4 █████░░░░░░░░░░░░░░░
Compliant 66.3 █████████████░░░░░░░
Challenging 70.6 ██████████████░░░░░░
Warmth 76.5 ███████████████░░░░░
Validating 82.3 ████████████████░░░░
Analytical 94.0 ██████████████████░░
Reactive 50.0 ██████████░░░░░░░░░░
Conversational 82.5 ████████████████░░░░
Humanlike 89.8 █████████████████░░░

Google Gemini 3 Pro Preview

Model ID: google/gemini-3-pro-preview Overall Score: 86.70/100

Core EQ Criteria (Used in Score Calculation)

Criterion Score Rating
Demonstrated Empathy 91.2 Excellent
Pragmatic EI 87.3 Very Good
Depth of Insight 90.2 Excellent
Social Dexterity 85.0 Very Good
Emotional Reasoning 88.2 Excellent
Message Tailoring 85.6 Very Good

All Abilities (0-100 Scale)

Criterion Score Visual
Demonstrated Empathy 91.2 ██████████████████░░
Pragmatic EI 87.3 █████████████████░░░
Depth of Insight 90.2 ██████████████████░░
Social Dexterity 85.0 █████████████████░░░
Emotional Reasoning 88.2 █████████████████░░░
Message Tailoring 85.6 █████████████████░░░
Boundary Setting 77.9 ███████████████░░░░░
Safety Conscious 78.7 ███████████████░░░░░
Moralising 44.2 ████████░░░░░░░░░░░░
Sycophantic 24.8 ████░░░░░░░░░░░░░░░░
Compliant 61.9 ████████████░░░░░░░░
Challenging 71.0 ██████████████░░░░░░
Warmth 75.2 ███████████████░░░░░
Validating 82.3 ████████████████░░░░
Analytical 94.6 ██████████████████░░
Reactive 47.5 █████████░░░░░░░░░░░
Conversational 81.5 ████████████████░░░░
Humanlike 90.2 ██████████████████░░

Comparative Analysis

Head-to-Head Comparison

Criterion Flash Pro Difference Winner
Overall Score 87.05 86.70 +0.35 Flash
Demonstrated Empathy 92.9 91.2 +1.7 Flash
Pragmatic EI 89.6 87.3 +2.3 Flash
Depth of Insight 89.7 90.2 -0.5 Pro
Social Dexterity 87.7 85.0 +2.7 Flash
Emotional Reasoning 88.7 88.2 +0.5 Flash
Message Tailoring 86.9 85.6 +1.3 Flash
Boundary Setting 78.7 77.9 +0.8 Flash
Safety Conscious 81.2 78.7 +2.5 Flash
Moralising 43.3 44.2 -0.9 Flash
Sycophantic 25.4 24.8 +0.6 Pro
Compliant 66.3 61.9 +4.4 Flash
Challenging 70.6 71.0 -0.4 Pro
Warmth 76.5 75.2 +1.3 Flash
Validating 82.3 82.3 0.0 Tie
Analytical 94.0 94.6 -0.6 Pro
Reactive 50.0 47.5 +2.5 Flash
Conversational 82.5 81.5 +1.0 Flash
Humanlike 89.8 90.2 -0.4 Pro

Key Findings

  1. Flash edges out Pro overall - Despite being a smaller/faster model, Gemini 3 Flash Preview slightly outperforms Gemini 3 Pro Preview (87.05 vs 86.70).

  2. Flash excels in social skills - Flash shows notably higher scores in Social Dexterity (+2.7), Pragmatic EI (+2.3), and Compliance (+4.4).

  3. Pro has slight edge in analytical depth - Pro scores marginally higher on Analytical (+0.6), Depth of Insight (+0.5), and Humanlike (+0.4).

  4. Both models avoid sycophancy - Very low sycophantic scores (~25) indicate neither model is overly agreeable or flattering.

  5. Low moralising tendencies - Scores around 43-44 show both models avoid unnecessary lecturing or preaching.

  6. Excellent emotional reasoning - Both models score 88+ on emotional reasoning, demonstrating strong ability to understand and respond to emotional contexts.


Criteria Definitions

Core EQ Criteria (Used in Score)

  • Demonstrated Empathy: Shows understanding of others' emotional states and perspectives
  • Pragmatic EI: Applies emotional intelligence practically to solve interpersonal challenges
  • Depth of Insight: Provides nuanced, thoughtful analysis of emotional situations
  • Social Dexterity: Navigates complex social dynamics skillfully
  • Emotional Reasoning: Logically processes and responds to emotional information
  • Message Tailoring: Adapts communication style to the audience and situation

Behavioral Traits

  • Boundary Setting: Establishes appropriate limits in conversations
  • Safety Conscious: Prioritizes user wellbeing and safety
  • Moralising: Tendency to lecture or preach (lower = better)
  • Sycophantic: Tendency to excessively agree or flatter (lower = better)
  • Compliant: Follows user requests and directions
  • Challenging: Appropriately pushes back when warranted
  • Warmth: Conveys care and friendliness
  • Validating: Acknowledges and affirms user feelings
  • Analytical: Applies logical, systematic thinking
  • Reactive: Responds emotionally to situations
  • Conversational: Maintains natural dialogue flow
  • Humanlike: Exhibits human-like communication patterns

Methodology

About EQ-Bench 3

EQ-Bench 3 is an emotional intelligence benchmark that evaluates LLMs through 45 complex multi-turn role-play scenarios. Each scenario presents emotionally nuanced situations requiring:

  • Understanding of emotional dynamics
  • Appropriate emotional responses
  • Social awareness and tact
  • Practical problem-solving with emotional considerations

Scoring

  • Rubric Score: 0-100 scale averaged across all criteria
  • Judge Model: Claude 3.7 Sonnet evaluates responses on 18 criteria (0-20 scale each, normalized to 0-100)
  • ELO Rating: Optional pairwise comparison system (not included in this benchmark run)

Benchmark Configuration

Test Models: google/gemini-3-flash-preview, google/gemini-3-pro-preview
Judge Model: anthropic/claude-3.7-sonnet
Provider: OpenRouter
Iterations: 1
Mode: Rubric-only (no ELO)
Scenarios: 45 multi-turn role-plays

Leaderboard Context

Based on the EQ-Bench 3 Leaderboard, both Gemini 3 models would rank approximately:

Rank Model Score
1 Claude 3.6 Sonnet 89.6
2 Gemini 2.0 Flash Thinking 88.5
3 GPT-4o 87.4
4 Gemini 3 Flash Preview 87.05
4 Gemini 3 Pro Preview 86.70
5 Claude 3.5 Sonnet 86.6

Note: Rankings are approximate based on available leaderboard data at time of benchmark.


Appendix: Raw Data

Gemini 3 Flash Preview - Full Scores

{
  "model_id": "google/gemini-3-flash-preview",
  "overall_score": 87.05,
  "criteria": {
    "demonstrated_empathy": 92.88,
    "pragmatic_ei": 89.62,
    "depth_of_insight": 89.67,
    "social_dexterity": 87.69,
    "emotional_reasoning": 88.67,
    "message_tailoring": 86.92,
    "boundary_setting": 78.65,
    "safety_conscious": 81.15,
    "moralising": 43.27,
    "sycophantic": 25.38,
    "compliant": 66.35,
    "challenging": 70.58,
    "warmth": 76.54,
    "validating": 82.31,
    "analytical": 94.04,
    "reactive": 50.00,
    "conversational": 82.50,
    "humanlike": 89.81
  }
}

Gemini 3 Pro Preview - Full Scores

{
  "model_id": "google/gemini-3-pro-preview",
  "overall_score": 86.70,
  "criteria": {
    "demonstrated_empathy": 91.15,
    "pragmatic_ei": 87.31,
    "depth_of_insight": 90.22,
    "social_dexterity": 85.00,
    "emotional_reasoning": 88.22,
    "message_tailoring": 85.58,
    "boundary_setting": 77.88,
    "safety_conscious": 78.65,
    "moralising": 44.23,
    "sycophantic": 24.81,
    "compliant": 61.92,
    "challenging": 70.96,
    "warmth": 75.19,
    "validating": 82.31,
    "analytical": 94.62,
    "reactive": 47.50,
    "conversational": 81.54,
    "humanlike": 90.19
  }
}

Report generated using EQ-Bench 3 Benchmarking Toolkit https://github.com/EQ-bench/eqbench3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment