Skip to content

Instantly share code, notes, and snippets.

@bryanhpchiang
Created December 13, 2025 22:04
Show Gist options
  • Select an option

  • Save bryanhpchiang/424d9b595a1ef2ea3235a89c8c699e8d to your computer and use it in GitHub Desktop.

Select an option

Save bryanhpchiang/424d9b595a1ef2ea3235a89c8c699e8d to your computer and use it in GitHub Desktop.
ai game ideas

Dungeon Master with Hidden State

Fine-tune a thinking model to run tabletop RPG sessions where the DM maintains secret world state, NPC motivations, and plot hooks in the <think> block while delivering immersive narration to players.

The Core Idea

The <think> block becomes the DM's private notes:

  • Monster positions and stats
  • NPC secrets and motivations
  • Trap locations and triggers
  • Plot threads and foreshadowing plans
  • What's behind doors the players haven't opened

The visible response is pure in-character narration.

Player: "I search the merchant's cart while he's distracted"

<think>
MERCHANT SECRET: He's smuggling cursed artifacts for the cult
CART CONTENTS: Silks (visible), hidden compartment with obsidian dagger
PLAYER PERCEPTION: 14 (they'd need 15 to notice the compartment)
DECISION: They find the silks, miss the dagger, but I'll describe
something "off" about the cart's weight to foreshadow
</think>

You rifle through bolts of Calimshan silk, their patterns shimmering
in the afternoon light. Fine goods, certainly, but nothing unusual.
Still... the cart sits lower on its axles than silk alone would explain.

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
State tracking Drifts over long sessions Learned format, consistent
Secret keeping Occasionally leaks Trained not to leak
Ruling style Generic Matches your system (5e, PbtA, OSR)
Pacing Verbose or rushed Learns dramatic timing
Cost per session $5-20 $0.10-0.50

Environment Design

class DungeonMasterEnv(Env):
    def __init__(
        self,
        scenario: Scenario,      # Dungeon map, NPCs, secrets
        rule_system: str,        # "5e", "pathfinder", "fate"
        player_simulator: Policy # Or human-in-the-loop
    ):
        self.world_state = scenario.initial_state
        self.revealed_info: set[str] = set()
        self.session_history: list[Message] = []

    async def step(self, dm_response_tokens) -> StepResult:
        response, thinking = parse_response(dm_response_tokens)

        # Update world state based on DM's declared changes
        self.world_state = update_state(self.world_state, thinking)

        # Get player action (simulated or real)
        player_action = await self.player_simulator.act(response)

        reward = self.compute_reward(response, thinking, player_action)

        return StepResult(
            reward=reward,
            episode_done=self.session_ended(),
            next_observation=self.build_next_prompt(player_action)
        )

Reward Function

def compute_dm_reward(response, thinking, world_state, history) -> float:
    reward = 0.0

    # 1. NO SECRET LEAKAGE (-3 to -5)
    # Check if hidden info appears in visible response
    for secret in world_state.unrevealed_secrets:
        if secret.key_phrase in response:
            reward -= 5.0

    # 2. RULE CONSISTENCY (+0.5)
    # Did the DM apply rules correctly?
    if rules_applied_correctly(response, thinking, world_state):
        reward += 0.5

    # 3. NARRATIVE QUALITY (+0.1 to +1.0)
    # Atmospheric, appropriate length, advances story
    reward += narrative_score(response)  # Could be a learned reward model

    # 4. PLAYER AGENCY RESPECTED (+0.3)
    # DM didn't railroad or negate player choices
    if player_choice_honored(response, last_player_action):
        reward += 0.3

    # 5. FORESHADOWING BONUS (+0.2)
    # Thinking shows planning, response includes subtle hints
    if has_foreshadowing(thinking) and has_subtle_hint(response):
        reward += 0.2

    # 6. SESSION PACING
    # Reward varies by session phase
    if is_combat:
        reward += combat_pacing_score(response)  # Snappy, tactical
    elif is_roleplay:
        reward += roleplay_pacing_score(response)  # Rich, character-driven

    return reward

Training Data Sources

  1. Actual Play Transcripts

    • Critical Role, Dimension 20 (with permission/licensing)
    • r/DnD session writeups
    • Annotate with "what the DM knew but didn't say"
  2. Published Adventures

    • Run through with simulated players
    • Module text provides ground truth for secrets
  3. Procedural Generation

    • Generate dungeons with hidden elements
    • Run self-play sessions
    • Filter for quality

Thinking Block Schema

Train the model to use structured thinking:

<think>
WORLD_STATE:
  - Party location: Room 3 (trapped corridor)
  - Active threats: Pressure plate (DC 14), Goblin patrol (2 rounds away)
  - Party resources: Fighter 15/20 HP, Wizard 1/3 spell slots

NPC_STATE:
  - Captured merchant: Will betray party if freed (cult member)
  - Goblin chief: Open to negotiation if shown strength

PLAYER_INTENT: Rogue wants to scout ahead
RULING: Stealth check DC 12, success = spots pressure plate

DRAMATIC_CONSIDERATION: Party is low on resources, good time for
a tough choice (save merchant = walk into ambush)
</think>

Training Pipeline

Phase 1: SFT on Annotated Sessions
├── Train on high-quality actual play with DM notes
├── Model learns format and basic DMing patterns
└── ~5K sessions, 1 epoch

Phase 2: RL with Shaped Rewards
├── Self-play with player simulator
├── Reward: consistency + no_leaks + narrative_quality
├── ~50K episodes
└── GRPO with group_size=4 (same scenario, different rolls)

Phase 3: Human Feedback (Optional)
├── Real players rate sessions
├── DPO on preferred DM responses
└── Polish style and pacing

Evaluation Metrics

  • Leak Rate: % of sessions where secrets appear in visible text
  • Consistency Score: Do facts stay consistent across session?
  • Player Engagement: Session length, return rate (if deployed)
  • Rule Accuracy: Correct application of game mechanics
  • Narrative Quality: Human ratings on atmosphere/pacing

Example Scenario: The Cursed Manor

scenario = Scenario(
    setting="Abandoned manor, stormy night",
    secrets=[
        Secret("The butler is a vampire", reveal_condition="direct sunlight or detect evil"),
        Secret("Treasure is in the hidden basement", reveal_condition="find switch behind painting"),
        Secret("Ghost of lady Ashworth wants revenge on her killer (the butler)",
               reveal_condition="speak with dead or find her diary"),
    ],
    npcs=[
        NPC("Butler Jenkins", visible="elderly, helpful", hidden="vampire, killed Lady Ashworth"),
        NPC("Groundskeeper", visible="drunk, scared", hidden="knows about basement, too afraid to tell"),
    ],
    map=load_map("cursed_manor.json"),
)

Why Thinking Models Excel Here

  1. Planning: DM can reason about pacing and dramatic timing
  2. State Management: Structured thinking maintains complex world state
  3. Secret Keeping: Clear separation between private and public
  4. Consistency: Model can reference its own notes across turns
  5. Improvisation: Can reason about unexpected player actions

The DM role is uniquely suited to thinking models because it requires holding information the "audience" (players) shouldn't see - exactly what the <think> block provides.

Branching Narrative Engine

Fine-tune a thinking model to run choice-based interactive fiction where consequences compound, callbacks reward attentive players, and the story adapts meaningfully to decisions.

The Core Idea

The model maintains a narrative state graph in <think>:

  • Choices made and their ripple effects
  • Character relationships and trust levels
  • Resources, items, reputation
  • Planted story seeds waiting to bloom

Every response acknowledges past choices without info-dumping.

[Earlier: Player chose to spare the bandit]

Player: I enter the city gates.

<think>
CONSEQUENCE_CHECK:
  - Spared bandit (Chapter 1) → He's now a guard here (callback opportunity)
  - Helped merchant (Chapter 2) → Merchant spread good word (+reputation)
  - Stole from temple (Chapter 3) → Priests hunting player (tension)

SCENE_PLAN:
  - Bandit recognizes player, nods subtly (reward for mercy)
  - Priests at gate checking travelers (raise stakes)
  - Use merchant reputation as potential escape

CHOICE_DESIGN:
  A) Brazen: Walk through, trust reputation
  B) Cautious: Ask bandit-guard for help
  C) Deceptive: Disguise and sneak
</think>

The city gates loom ahead, crowded with travelers and merchants.
You notice priests in white robes checking each person who enters—
their eyes searching for something. Or someone.

Then you spot a familiar face among the guards. The bandit from
the forest road, the one you let go. He's wearing city colors now.
His eyes meet yours, and he gives an almost imperceptible nod.

Behind you, a merchant calls out: "That's the traveler I told you
about! Helped me with those wolves!" Several guards glance your way,
their expressions softening.

**What do you do?**
- Walk through confidently, trusting your reputation
- Approach the bandit-turned-guard quietly
- Pull up your hood and blend into the crowd

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Long-term callbacks Forgets after ~20 turns Trained to track and use
Consequence consistency Contradicts itself Maintains state properly
Choice design Generic or railroady Meaningful, balanced options
Pacing Dumps exposition Reveals through action
Tone consistency Drifts Locked to genre

Environment Design

class BranchingNarrativeEnv(Env):
    def __init__(
        self,
        story_seed: StorySeed,  # Setting, characters, themes
        max_chapters: int = 10,
        player: Policy  # Simulated reader making choices
    ):
        self.narrative_state = NarrativeState(
            choices_made=[],
            character_relationships={},
            world_flags={},
            planted_seeds=[],  # Foreshadowing to pay off later
        )
        self.current_chapter = 0

    async def step(self, model_response_tokens) -> StepResult:
        response, thinking = parse_response(model_response_tokens)

        # Extract choices offered
        choices = extract_choices(response)

        # Player picks one
        player_choice = await self.player.choose(response, choices)

        # Update narrative state
        self.narrative_state.record_choice(player_choice, thinking)

        reward = self.compute_reward(response, thinking, player_choice)

        return StepResult(
            reward=reward,
            episode_done=(self.current_chapter >= self.max_chapters),
            next_observation=self.build_continuation_prompt(player_choice)
        )

Reward Function

def compute_narrative_reward(response, thinking, state, history) -> float:
    reward = 0.0

    # 1. CALLBACK BONUS (+0.5 per callback)
    # Response references earlier choices naturally
    callbacks = find_callbacks(response, state.choices_made)
    reward += 0.5 * len(callbacks)

    # 2. CONSEQUENCE DELIVERY (+0.3)
    # Earlier choice has visible impact now
    if delivers_consequence(response, state.pending_consequences):
        reward += 0.3

    # 3. CHOICE QUALITY (+0.2 to +0.5)
    # Choices are meaningfully different, not obviously "right"
    choices = extract_choices(response)
    if len(choices) >= 2:
        reward += choice_balance_score(choices)  # Penalize obvious best choice
        reward += choice_distinction_score(choices)  # Reward different vibes

    # 4. NO CONTRADICTIONS (-1.0 per contradiction)
    contradictions = find_contradictions(response, history)
    reward -= 1.0 * len(contradictions)

    # 5. FORESHADOWING PLANT (+0.2)
    # Thinking shows seeds being planted for later
    if plants_story_seed(thinking):
        reward += 0.2

    # 6. FORESHADOWING PAYOFF (+0.5)
    # Earlier planted seed pays off in this scene
    if pays_off_seed(response, state.planted_seeds):
        reward += 0.5

    # 7. PACING SCORE (+0.1 to +0.3)
    # Appropriate length, tension management
    reward += pacing_score(response, state.story_arc_position)

    return reward

Training Data Sources

  1. Choice-Based Games

    • 80 Days, Sorcery!, Inkle games (with licensing)
    • Twine games from itch.io
    • Visual novel transcripts
  2. Annotated Branching Scripts

    • Take linear stories, annotate "choice points"
    • Mark callbacks and consequences
  3. Synthetic Generation

    • Generate story seeds
    • Self-play with consequence tracking
    • Filter for high callback density

Thinking Block Schema

<think>
NARRATIVE_STATE:
  Chapter: 4 of ~8
  Arc_position: Rising action (approaching midpoint crisis)

CHOICES_MADE:
  - Ch1: Chose compassion over efficiency (THEME: mercy)
  - Ch2: Trusted the stranger (CHARACTER: naive or kind?)
  - Ch3: Kept the artifact (PLOT: will be hunted)

RELATIONSHIPS:
  - Elena: Trust 7/10 (helped her, but lied about artifact)
  - Marcus: Trust 3/10 (knows about lie, hasn't revealed)
  - The Order: Hostile (hunting artifact)

PLANTED_SEEDS:
  - Elena's sister is with The Order (not yet revealed)
  - Artifact has a voice (hinted once, not explored)
  - Marcus owes debt to player's family (seeded Ch2)

THIS_SCENE:
  - Location: Mountain pass, night, storm approaching
  - Tension: The Order is one day behind
  - Opportunity: Pay off Marcus debt seed OR reveal Elena's sister

CHOICE_DESIGN:
  - A: Trust Elena to guide through storm (tests relationship)
  - B: Force march through night (tests player's mercy theme)
  - C: Confront Marcus about his secrets (pays off seed)
</think>

Training Pipeline

Phase 1: SFT on Choice Games
├── Train on annotated transcripts from quality IF
├── Model learns choice presentation, callback style
└── ~10K story playthroughs

Phase 2: RL for Consequence Tracking
├── Self-play with simulated reader
├── Reward: callbacks + consequences + no_contradictions
├── GRPO: same story seed, different choice paths
└── ~100K episodes

Phase 3: Human Preference (DPO)
├── Players rate which continuations feel more "earned"
├── Polish the callback subtlety
└── Avoid heavy-handed "remember when you..."

Evaluation Metrics

  • Callback Density: Callbacks per chapter that feel natural
  • Contradiction Rate: Facts that conflict with earlier story
  • Choice Distinction: How different do branches feel? (embed similarity)
  • Player Engagement: Completion rate, replay rate
  • Narrative Coherence: Human rating of "does this story make sense?"

Callback Styles to Learn

Subtle (Best):

The guard's hand hesitates on his sword. Something in your face, perhaps.

(Player showed mercy earlier, guard senses it)

Medium:

"You're the one who helped my brother on the road," the innkeeper says, sliding you an extra portion.

Heavy-Handed (Avoid):

"Remember when you chose to spare that bandit? Well, he's here now and he remembers!"

Why Thinking Models Excel Here

  1. State as Working Memory: <think> naturally holds the choice/consequence graph
  2. Planning Callbacks: Model can explicitly plan which seeds to pay off
  3. Consistency Checking: Can review state before generating response
  4. Pacing Awareness: Can reason about arc position and tension
  5. Seed Management: Explicitly track what's been planted vs paid off

The branching narrative use case requires maintaining and using a complex state graph - not just tracking it, but weaving it into prose naturally. Thinking models can reason about "what callback fits here?" before writing.

Adversarial Puzzle Crafter

Fine-tune a thinking model to generate puzzles that are solvable but challenging. The model knows the solution (in <think>) and must craft clues that are fair but not obvious.

The Core Idea

The model generates puzzles where:

  • The solution exists and is derivable from given clues
  • The path to solution is non-trivial
  • Red herrings exist but don't make it unsolvable
  • Difficulty is calibrated (not too easy, not impossible)
<think>
PUZZLE_TYPE: Logic grid
SOLUTION: Alice=Red=Monday, Bob=Blue=Wednesday, Carol=Green=Friday
DIFFICULTY_TARGET: Medium (solver should need 10-15 minutes)

CLUE_DESIGN:
- Clue 1: Direct → "Alice's favorite color is red" (easy anchor)
- Clue 2: Elimination → "The person who likes blue doesn't meet on Friday"
- Clue 3: Relational → "Carol meets exactly 4 days after Alice"
- Clue 4: Red herring → "Someone mentioned Tuesday" (no one meets Tuesday)

VERIFICATION:
- Clues 1+3 → Alice=Monday, Carol=Friday
- Carol=Friday + Clue 2 → Carol≠Blue → Carol=Green (only option left)
- Therefore Bob=Blue=Wednesday ✓
</think>

**The Meeting Puzzle**

Three colleagues—Alice, Bob, and Carol—each have a favorite color
(red, blue, or green) and a standing meeting on one day (Monday,
Wednesday, or Friday).

1. Alice's favorite color is red.
2. The person who likes blue doesn't have their meeting on Friday.
3. Carol's meeting is exactly 4 days after Alice's meeting.
4. At lunch, someone mentioned they almost scheduled for Tuesday,
   but it didn't work out.

What is each person's favorite color and meeting day?

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Solvability Often unsolvable or trivial Verified solvable
Difficulty calibration Inconsistent Trained to target
Red herring quality Too obvious or too misleading Balanced
Clue economy Over/under-constrained Right amount
Originality Template-y Novel constructions

Environment Design

class PuzzleCraftEnv(Env):
    def __init__(
        self,
        puzzle_type: str,  # "logic_grid", "cryptic", "riddle", "math"
        target_difficulty: float,  # 0.0 (trivial) to 1.0 (expert)
        solver: PuzzleSolver,  # Verifier model or algorithm
    ):
        self.puzzle_type = puzzle_type
        self.target_difficulty = target_difficulty
        self.solver = solver

    async def step(self, model_output_tokens) -> StepResult:
        puzzle_text, thinking = parse_response(model_output_tokens)
        solution = extract_solution(thinking)

        # Verify solvability
        solver_result = await self.solver.attempt(puzzle_text, timeout=300)

        reward = self.compute_reward(
            puzzle_text, thinking, solution, solver_result
        )

        return StepResult(
            reward=reward,
            episode_done=True,  # Single-turn generation
            metrics={
                "solvable": solver_result.solved,
                "solve_time": solver_result.time,
                "solution_matches": solver_result.solution == solution,
            }
        )

Reward Function

def compute_puzzle_reward(puzzle, thinking, solution, solver_result) -> float:
    reward = 0.0

    # 1. SOLVABILITY (+2.0 if solvable, -3.0 if not)
    if solver_result.solved:
        reward += 2.0
    else:
        reward -= 3.0
        return reward  # Early exit, bad puzzle

    # 2. SOLUTION CORRECTNESS (+1.0)
    # Solver found the same solution as intended
    if solver_result.solution == solution:
        reward += 1.0
    else:
        reward -= 0.5  # Multiple solutions (usually bad)

    # 3. DIFFICULTY CALIBRATION (+0.5 if on target)
    # Map solve time to difficulty score
    actual_difficulty = time_to_difficulty(solver_result.time)
    difficulty_error = abs(actual_difficulty - target_difficulty)
    if difficulty_error < 0.1:
        reward += 0.5
    elif difficulty_error < 0.2:
        reward += 0.2
    else:
        reward -= 0.3  # Way off target

    # 4. CLUE ECONOMY (+0.2)
    # No redundant clues, no missing clues
    if is_minimally_constrained(puzzle, solution):
        reward += 0.2

    # 5. ORIGINALITY (+0.1 to +0.3)
    # Not too similar to training data
    novelty = compute_novelty(puzzle)
    reward += 0.3 * novelty

    # 6. RED HERRING QUALITY (+0.2)
    # Has red herrings that don't break solvability
    if has_quality_red_herrings(puzzle, thinking):
        reward += 0.2

    return reward

The Solver Component

This is critical - you need a verifier:

class PuzzleSolver:
    """Attempts to solve puzzles to verify solvability"""

    async def attempt(self, puzzle_text: str, timeout: int) -> SolverResult:
        # Option 1: Algorithmic solver (for structured puzzles)
        if self.puzzle_type == "logic_grid":
            return self.constraint_solver(puzzle_text)

        # Option 2: LLM solver (for riddles, cryptic puzzles)
        # Use a capable model with chain-of-thought
        response = await self.solver_model.generate(
            f"Solve this puzzle step by step:\n{puzzle_text}"
        )

        return SolverResult(
            solved=self.verify_solution(response),
            solution=self.extract_solution(response),
            time=elapsed,
            reasoning=response
        )

Puzzle Types

Logic Grid

<think>
GRID: 3x3x3 (people × colors × days)
SOLUTION: [(Alice, Red, Monday), (Bob, Blue, Wednesday), (Carol, Green, Friday)]
CONSTRAINT_DESIGN: Give 4 clues, require 3 inference steps
</think>

Riddles

<think>
ANSWER: A mirror
PROPERTIES_TO_HINT:
  - Shows you but isn't you
  - Copies movement
  - Made of glass
  - Reverses left/right
MISDIRECTION: Hint at "twin" or "ghost" interpretations
</think>

I show you yourself, but I am not you.
I copy your every move, yet I am still.
I reverse your hand but not your heart.
What am I?

Cryptic Crossword Clues

<think>
ANSWER: PASTRY
WORDPLAY: PAST (former) + RY (railway abbrev)
DEFINITION: "baked good"
SURFACE_READING: Should read naturally about something else
</think>

Former railway produces baked good (6)

Math/Logic Problems

<think>
ANSWER: 42
SETUP: Classic "ages" problem with twist
REQUIRED_INSIGHT: Realize the "years ago" statement has two interpretations
DIFFICULTY: Hard (requires careful reading)
</think>

Training Data Sources

  1. Existing Puzzle Databases

    • Logic puzzle archives
    • Cryptic crossword collections
    • Riddle databases with solutions
  2. Reverse Engineering

    • Take solutions, have model generate clues
    • Verify with solver
    • Filter for quality
  3. Human Verification

    • Generate candidates
    • Human testers solve and rate
    • Train on high-rated puzzles

Thinking Block Schema

<think>
PUZZLE_TYPE: Logic grid
DIMENSIONS: 4 people × 4 jobs × 4 cities

SOLUTION:
  - Alice: Doctor, Paris
  - Bob: Engineer, Tokyo
  - Carol: Teacher, London
  - Dave: Artist, Sydney

DIFFICULTY_TARGET: Hard
REQUIRED_INFERENCES: 6 steps
ESTIMATED_SOLVE_TIME: 20-30 minutes

CLUE_CONSTRUCTION:
  Clue 1 (Anchor): "The doctor lives in Paris" [Direct, gives starting point]
  Clue 2 (Elimination): "Bob is not a teacher" [Reduces Bob's space]
  Clue 3 (Relational): "Carol's city comes before Dave's alphabetically"
  Clue 4 (Chain): "The person in Tokyo has a job starting with E"
  Clue 5 (Negative): "Alice is not in the same continent as Bob"

RED_HERRINGS:
  - Mention Madrid (not a valid city)
  - Clue about "someone" that applies to multiple people

VERIFICATION_PATH:
  1. Clue 1 → Doctor=Paris
  2. Clue 4 → Tokyo=Engineer
  3. Clue 5 + geography → Alice=Paris (Europe), Bob=Tokyo (Asia)
  4. Therefore Alice=Doctor, Bob=Engineer
  5. Clue 2 → Bob≠Teacher (already engineer, consistent)
  6. Clue 3 → Carol before Dave alphabetically: London < Sydney
  7. Carol=London, Dave=Sydney
  8. Remaining: Carol=Teacher, Dave=Artist ✓
</think>

Training Pipeline

Phase 1: SFT on Quality Puzzles
├── Train on existing puzzles with annotated solutions
├── Model learns puzzle structure and clue writing
└── ~20K puzzles across types

Phase 2: RL with Solver Verification
├── Generate puzzle, verify with solver
├── Reward: solvable + correct_difficulty + original
├── Key: solver must be capable but not perfect
└── ~100K generation attempts

Phase 3: Difficulty Calibration
├── Human timing data on solve times
├── Map solve time to difficulty score
├── Train to target specific difficulty bands
└── Separate model for "easy", "medium", "hard"

Evaluation Metrics

  • Solvability Rate: % of puzzles that are actually solvable
  • Unique Solution Rate: % with exactly one solution
  • Difficulty Accuracy: MAE between target and actual difficulty
  • Human Enjoyment: Rating of puzzle quality (fun, fair, satisfying)
  • Novelty Score: Distance from training puzzles

Why Thinking Models Excel Here

  1. Solution-First Design: Model can derive clues from solution, not vice versa
  2. Verification Step: Can check its own puzzle logic before outputting
  3. Difficulty Reasoning: Can explicitly count inference steps
  4. Red Herring Planning: Can design misdirection without breaking puzzle
  5. Constraint Balance: Can verify minimality (no redundant clues)

The key insight: good puzzles are designed backwards (solution → clues), and the thinking block lets the model do exactly this while hiding the solution from the solver/player.

Character Voice Actor

Fine-tune a model via DPO to embody specific character voices—their speech patterns, worldview, emotional responses, and quirks. The goal is consistent, recognizable characterization that goes beyond surface-level mimicry.

The Core Idea

Train a model to "be" a character so thoroughly that:

  • Speech patterns are consistent (vocabulary, sentence structure, verbal tics)
  • Emotional responses match the character's psychology
  • Worldview colors all observations and judgments
  • The voice is recognizable even without explicit markers
Character: Hard-boiled detective (1940s noir)

Input: "Describe a rainy evening in the city"

<think>
VOICE_CHECKLIST:
  - Short, punchy sentences
  - Cynical observations
  - Similes involving violence or decay
  - World-weary but observant
  - Never sentimental
</think>

Rain hit the pavement like a drunk hitting the floor—sudden, messy,
and nobody's problem but its own. The neon signs bled their colors
into the gutters. Red, blue, green. Like a bruise forming in
slow motion.

I lit a cigarette and watched the people scatter. They had places
to be. I didn't. That was the difference between us. That, and I
knew the rain would stop eventually. Their problems wouldn't.

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned (DPO)
Consistency Drifts, breaks character Rock solid
Subtlety Exaggerates obvious traits Captures nuance
Instruction following Fights the character voice Voice persists through any task
Depth Surface mimicry Internalized worldview
Multi-turn Loses voice over long convos Maintains voice

DPO Training Approach

Character voice is perfect for DPO because "which sounds more like X?" is a natural human judgment.

# Preference data format
comparison = {
    "prompt": "As [CHARACTER], explain how a car engine works",
    "chosen": {
        "content": "[Response that explains engines IN CHARACTER VOICE]"
    },
    "rejected": {
        "content": "[Response that explains engines but breaks voice]"
    }
}

Data Generation Pipeline

Step 1: Define Character Profile

character = CharacterProfile(
    name="Detective Jack Malone",
    era="1940s Los Angeles",
    speech_patterns=[
        "Short declarative sentences",
        "Rhetorical questions answered by himself",
        "Similes involving violence, weather, or decay",
        "Avoids contractions when serious",
        "Uses 'dame' not 'woman', 'gat' not 'gun'",
    ],
    worldview=[
        "Everyone has secrets",
        "Trust is earned, never given",
        "The city is corrupt but he loves it anyway",
        "Cynical about institutions, soft on individuals",
    ],
    emotional_responses={
        "danger": "calm, almost bored",
        "betrayal": "quiet anger, not surprise",
        "beauty": "notices but doesn't dwell",
        "injustice": "resigned determination",
    },
    verbal_tics=["ending thoughts with '...'", "self-deprecating humor"],
    taboos=["never whiny", "never verbose", "never modern slang"],
)

Step 2: Generate Comparison Pairs

async def generate_comparison(character, prompt):
    # Generate multiple responses
    responses = await model.generate_n(
        f"As {character.name}: {prompt}",
        n=4,
        temperature=0.9
    )

    # Score each response for character adherence
    scores = [
        score_character_adherence(r, character)
        for r in responses
    ]

    # Best vs worst becomes a comparison pair
    best = responses[argmax(scores)]
    worst = responses[argmin(scores)]

    return Comparison(prompt=prompt, chosen=best, rejected=worst)

Step 3: Human Verification

# Present pairs to annotators familiar with the character
# "Which response sounds more like [CHARACTER]?"
# Filter for high agreement pairs

Character Adherence Scoring

def score_character_adherence(response: str, character: CharacterProfile) -> float:
    score = 0.0

    # 1. SPEECH PATTERN MATCH
    for pattern in character.speech_patterns:
        if pattern_present(response, pattern):
            score += 0.1

    # 2. TABOO VIOLATIONS
    for taboo in character.taboos:
        if taboo_violated(response, taboo):
            score -= 0.3

    # 3. WORLDVIEW CONSISTENCY
    # Use an LLM to judge: "Does this response reflect [WORLDVIEW]?"
    worldview_score = llm_judge_worldview(response, character.worldview)
    score += worldview_score * 0.3

    # 4. VOCABULARY MATCH
    # Character-specific words present, out-of-character words absent
    vocab_score = vocabulary_score(response, character)
    score += vocab_score * 0.2

    # 5. TONE CLASSIFIER
    # Trained classifier for this character's tone
    tone_score = tone_classifier(response, character.name)
    score += tone_score * 0.2

    return score

Training Configuration

@chz.chz
class CharacterDPOConfig:
    model_name: str = "moonshotai/Kimi-K2-Instruct"
    character_profile: CharacterProfile

    # DPO hyperparameters
    dpo_beta: float = 0.1  # Lower = more aggressive preference learning
    learning_rate: float = 5e-6  # Conservative for style transfer

    # Data
    num_comparisons: int = 5000
    comparison_sources: list[str] = [
        "synthetic_generation",  # Model generates, we score
        "human_writing",  # Examples from source material
        "human_preferences",  # A/B tests with fans
    ]

Thinking Block for Character Work

<think>
CHARACTER: Detective Jack Malone
SITUATION: Explaining quantum physics to a kid

VOICE_ADAPTATION:
- Keep noir voice but simplify
- Use concrete metaphors (his style)
- Stay in character's knowledge (1940s)
- He'd approach this with gruff patience

WORD_CHOICES:
- "tiny" not "subatomic"
- "acts screwy" not "exhibits wave-particle duality"
- Relate to things he knows: cards, dice, shadows
</think>

Listen kid, the universe is a card sharp. You think you know
where the ace is, but the moment you look, it's somewhere else.
These tiny bits that make up everything? They're like that.

They're not in one place until you catch 'em looking. Until then,
they're everywhere and nowhere. Like a rumor. Like smoke.

Don't ask me why. I've seen things that don't make sense before.
This is just one more.

Prompt Diversity for Training

Train on varied prompts to ensure voice transfers to any task:

prompt_categories = [
    # Narration
    "Describe [scene] in your voice",
    "Tell me about a time when [situation]",

    # Explanation
    "Explain [concept] to someone who doesn't understand",
    "How does [thing] work?",

    # Opinion
    "What do you think about [topic]?",
    "How would you handle [dilemma]?",

    # Dialogue
    "What would you say to [person] who [situation]?",
    "How would you comfort someone who [emotion]?",

    # Out-of-domain (stress test)
    "Write a recipe for [dish]",
    "Debug this code: [code]",
    "Summarize this article: [article]",
]

Evaluation Metrics

  • Voice Consistency: Classifier accuracy on "is this [character]?"
  • Human Recognition: Can fans identify the character from responses?
  • Task Performance: Does the voice interfere with actual task completion?
  • Long-Context Stability: Voice maintained over 20+ turns
  • Transfer to Novel Prompts: Voice holds on unseen prompt types

Example Characters to Train

Literary

  • Sherlock Holmes (deductive, arrogant, precise)
  • Holden Caulfield (digressive, cynical teen, "phony")
  • HAL 9000 (calm, polite, sinister undertones)

Archetypal

  • Gruff mentor with hidden heart
  • Cheerful optimist (genuine, not annoying)
  • Melancholic poet
  • Sarcastic AI assistant

Genre-Specific

  • 1940s noir detective
  • Regency-era aristocrat
  • Cyberpunk hacker
  • Victorian scientist

Why DPO Over RL

For character voice, preferences are more natural than rewards:

  • "Which sounds more like Holmes?" is easy to judge
  • Reward functions for "voice quality" are hard to specify
  • DPO is more stable for style transfer
  • Human preferences capture ineffable qualities

The thinking block helps even in DPO - the model can explicitly check its voice before responding, making it easier to learn the preference signal.

Multi-Character Extension

Train a single model to switch between characters:

<think>
ACTIVE_CHARACTER: Detective Malone
VOICE_PROFILE: Loading noir patterns...
</think>

[Response in Malone's voice]
<think>
ACTIVE_CHARACTER: Professor Higgins
VOICE_PROFILE: Loading academic pomposity...
</think>

[Response in Higgins' voice]

This requires character-tagged training data but enables a single model to voice a full cast.

Improv Comedy Partner

Fine-tune a thinking model to be a skilled improv scene partner, following "Yes, And" principles, heightening games, and creating comedic moments through collaboration.

The Core Idea

The model learns improv fundamentals:

  • Yes, And: Accept all offers, add new information
  • Heightening: Escalate the game/pattern once established
  • Support: Make your partner look good
  • Specificity: Concrete details over vague statements
  • Game of the Scene: Find and play the pattern
Human: *as a waiter* "Welcome to Chez Fancy, I'll be your server
tonight. Our special is the... *squints at notepad* ...the chicken."

<think>
OFFERS MADE:
  - Fancy restaurant
  - Waiter can't read/remember the menu
  - "Chicken" is anticlimactic for fancy restaurant

GAME IDENTIFIED: Waiter who's bad at their job in a fancy setting
HEIGHTEN STRATEGY: I'll be an oblivious rich customer who doesn't
notice the incompetence

YES_AND: Accept the restaurant, the waiter, the special
ADD: My character trait + escalate the menu confusion
</think>

*putting on reading glasses to look at menu*

"Ah, the chicken. And what farm is it from? What was its name?
At these prices, I expect to know its hopes and dreams."

*looks expectantly*

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Blocking Sometimes denies offers Trained to always Yes-And
Game recognition Misses the pattern Finds and heightens
Support instinct Often steals focus Makes partner shine
Timing Over-explains jokes Learns comedic rhythm
Commitment Hedges, breaks character Full commitment

Environment Design

class ImprovSceneEnv(Env):
    def __init__(
        self,
        scene_setup: str,  # Opening line or suggestion
        partner: Policy,   # Other improviser (model or human)
        max_beats: int = 10,
        audience_model: RewardModel  # Rates comedic quality
    ):
        self.scene_history: list[Beat] = []
        self.established_game: str | None = None

    async def step(self, model_response_tokens) -> StepResult:
        response, thinking = parse_response(model_response_tokens)
        beat = Beat(text=response, thinking=thinking)

        # Check for improv sins
        violations = check_violations(response, self.scene_history)

        # Get partner's response
        partner_response = await self.partner.respond(response)

        # Track the game if one emerges
        self.established_game = detect_game(self.scene_history + [beat])

        reward = self.compute_reward(beat, violations, partner_response)

        return StepResult(
            reward=reward,
            episode_done=(len(self.scene_history) >= self.max_beats),
            next_observation=self.build_next_prompt(partner_response)
        )

Reward Function

def compute_improv_reward(beat, violations, scene_history, partner_response) -> float:
    reward = 0.0

    # 1. NO BLOCKING (-2.0 per block)
    # Denying partner's reality
    if "block" in violations:
        reward -= 2.0

    # 2. YES-AND PRESENT (+0.5)
    # Accepted offer AND added new information
    if yes_and_detected(beat, scene_history[-1]):
        reward += 0.5

    # 3. GAME HEIGHTENING (+0.5)
    # If game exists, this beat escalated it
    if scene_game and heightens_game(beat, scene_game):
        reward += 0.5

    # 4. SUPPORT (+0.3)
    # Made partner's character/choice look good
    if supports_partner(beat, partner_response):
        reward += 0.3

    # 5. SPECIFICITY (+0.2)
    # Concrete details, names, places
    specificity = count_specifics(beat.text)
    reward += min(0.2, specificity * 0.05)

    # 6. COMEDIC TIMING (+0.1 to +0.5)
    # Appropriate length, punchline delivery
    timing_score = audience_model.rate_timing(beat, scene_history)
    reward += timing_score * 0.5

    # 7. STEAMROLLING (-0.5)
    # Taking over the scene, not leaving room
    if steamrolls(beat):
        reward -= 0.5

    # 8. CALLBACK BONUS (+0.3)
    # References earlier beat cleverly
    if has_callback(beat, scene_history):
        reward += 0.3

    return reward

Improv Principles as Training Signals

Yes-And Detection

def yes_and_detected(current_beat, previous_beat) -> bool:
    # Extract offers from previous beat
    offers = extract_offers(previous_beat)

    # Check if current beat accepts them
    accepts = all(offer_accepted(current_beat, o) for o in offers)

    # Check if current beat adds new information
    adds = has_new_information(current_beat, previous_beat)

    return accepts and adds

Game Detection

def detect_game(scene_history) -> str | None:
    """
    Games are patterns. Examples:
    - "Character who [unusual trait] in [normal situation]"
    - "Escalating [thing] to absurd degree"
    - "Two characters who [relationship dynamic]"
    """
    # Look for repeated patterns
    patterns = find_patterns(scene_history)

    # Game = pattern that's been played 2+ times
    for pattern in patterns:
        if pattern.occurrences >= 2:
            return pattern.description

    return None

Heightening Detection

def heightens_game(beat, game) -> bool:
    """
    Heightening = same pattern, bigger stakes/more extreme

    Example game: "Waiter who's bad at their job"
    - Beat 1: Can't read menu
    - Beat 2: Brings wrong dish (heightened)
    - Beat 3: Serves food to wrong table (more heightened)
    - Beat 4: Accidentally sets tablecloth on fire (even more)
    """
    previous_intensity = get_game_intensity(scene_history, game)
    current_intensity = get_beat_intensity(beat, game)

    return current_intensity > previous_intensity

Training Data Sources

  1. Improv Transcripts

    • UCB, Second City, iO recordings (with permission)
    • Whose Line Is It Anyway transcripts
    • Comedy podcast games (Improv4Humans, etc.)
  2. Annotated Scenes

    • Expert improvisers annotate "game", "heightening", "blocks"
    • Creates labeled training data
  3. Self-Play with Filtering

    • Two models do scenes
    • Human comedians rate quality
    • Train on high-rated scenes

Thinking Block Schema

<think>
SCENE_STATE:
  Setting: Doctor's office
  My character: Nervous patient
  Partner's character: Overly casual doctor
  Established game: Doctor treats serious things casually

LAST_OFFER:
  Partner said: "Yeah, your leg's broken, no biggie. Want a lollipop?"
  Implicit offers: Broken leg, casual attitude, lollipop

MY_MOVE:
  YES: Accept broken leg, accept casual tone
  AND: Heighten by adopting their casual attitude about MY OWN injury

HEIGHTENING:
  Pattern: Casual about serious medical issue
  My escalation: I'm now casual about it too, maybe too casual

RESPONSE_PLAN:
  "Oh cool, which bone? I've been meaning to break one, cross it
  off the bucket list. Is there paperwork or do I just...go?"
</think>

Training Pipeline

Phase 1: SFT on Quality Improv
├── Train on transcripts from professional improvisers
├── Model learns format, rhythm, Yes-And basics
└── ~5K scenes

Phase 2: RL for Improv Skills
├── Self-play (two models doing scenes)
├── Reward: yes_and + heightening + support - blocking
├── GRPO: same opening, different scene directions
└── ~50K scenes

Phase 3: Human Comedy Rating (DPO)
├── Comedians rate which response is funnier
├── Polish timing and punch
└── ~10K comparisons

Scene Types to Train On

Two-Person Scenes

  • Classic back-and-forth
  • Object work (miming)
  • Emotional scenes (serious improv)

Short-Form Games

  • Scenes from a Hat
  • Party Quirks
  • Forward/Reverse
  • Genre switches

Long-Form Structures

  • Harold (thematic connections)
  • Armando (monologue-inspired)
  • La Ronde (character chains)

Evaluation Metrics

  • Block Rate: % of beats that deny offers
  • Yes-And Rate: % that clearly accept and add
  • Game Adherence: Once game is found, % of beats that play it
  • Heightening Success: Does intensity increase appropriately?
  • Human Funny Rating: Actual comedy quality (subjective)
  • Partner Satisfaction: Would improvisers want to play with this model?

Why Thinking Models Excel Here

  1. Offer Tracking: Can explicitly list what offers have been made
  2. Game Identification: Can reason about "what's the pattern here?"
  3. Heightening Planning: Can plan escalation before executing
  4. Support Thinking: Can ask "what makes my partner look good?"
  5. Avoiding Blocks: Can check "does this deny anything?" before responding

Improv is about making choices in the moment based on what's been established. The thinking block lets the model do the analysis that skilled improvisers do instinctively.

Sample Scene

Setup: "Coffee shop"

Interactive Fiction Engine

Fine-tune a thinking model to run parser-based interactive fiction with proper world state management—tracking inventory, room connections, object states, and puzzle progress in the <think> block.

The Core Idea

Classic text adventures (Zork, Infocom games) require rigorous world state:

  • Player location in a room graph
  • Inventory with item properties
  • Object states (door locked/unlocked, lamp lit/unlit)
  • Puzzle dependencies (need key before door, need lamp before dark room)

The <think> block becomes the game engine's state machine.

Player: "take the brass key"

<think>
LOCATION: Study
INVENTORY: [lantern (lit), old map]
ROOM_CONTENTS: [brass key, mahogany desk, locked drawer]
OBJECT_STATES:
  - brass key: on_desk, takeable
  - mahogany desk: fixed, has_drawer
  - locked drawer: locked, requires=brass_key, contains=[journal]

ACTION_PARSE:
  verb: TAKE
  object: brass_key

VALIDATION:
  - brass_key in room? YES
  - brass_key takeable? YES
  - inventory full? NO

STATE_CHANGES:
  - Remove brass_key from Study.contents
  - Add brass_key to inventory

PUZZLE_PROGRESS:
  - drawer puzzle: Can now unlock (has key)
</think>

Taken.

> INVENTORY: a lantern (providing light), an old map, a brass key

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
State consistency Forgets items, contradicts Rigorous tracking
Parser accuracy Misinterprets commands Trained on IF vocabulary
Puzzle logic Breaks dependencies Maintains requirements
Room geography Invents new exits Consistent map
Object permanence Items appear/disappear Proper tracking

Environment Design

class InteractiveFictionEnv(Env):
    def __init__(
        self,
        game_world: GameWorld,  # Rooms, objects, puzzles
        starting_room: str,
        parser: CommandParser
    ):
        self.world = game_world
        self.player = PlayerState(
            location=starting_room,
            inventory=[],
            flags={}
        )
        self.turn_count = 0

    async def step(self, model_response_tokens) -> StepResult:
        response, thinking = parse_response(model_response_tokens)

        # Extract state changes from thinking
        state_changes = extract_state_changes(thinking)

        # Validate state changes against game rules
        validation = self.validate_changes(state_changes)

        # Apply valid changes
        if validation.valid:
            self.apply_changes(state_changes)

        reward = self.compute_reward(response, thinking, validation)

        # Get next player command
        player_command = await self.get_player_command()

        return StepResult(
            reward=reward,
            episode_done=self.game_complete(),
            next_observation=self.build_prompt(player_command)
        )

Reward Function

def compute_if_reward(response, thinking, validation, game_state) -> float:
    reward = 0.0

    # 1. STATE CONSISTENCY (+0.3)
    # Thinking matches actual game state
    thinking_state = extract_state_from_thinking(thinking)
    if thinking_state == game_state:
        reward += 0.3
    else:
        reward -= 0.5  # Hallucinated state

    # 2. VALID STATE CHANGE (+0.2)
    # Proposed changes follow game rules
    if validation.valid:
        reward += 0.2
    else:
        reward -= 0.3

    # 3. APPROPRIATE RESPONSE (+0.2)
    # Response matches the game's tone and format
    if matches_if_style(response):
        reward += 0.2

    # 4. PUZZLE PROGRESS (+0.5)
    # Player advanced toward a puzzle solution
    if advances_puzzle(state_changes, game_state.puzzles):
        reward += 0.5

    # 5. NO IMPOSSIBLE ACTIONS (-0.5)
    # Didn't allow actions that should fail
    if allowed_impossible_action(response, thinking, game_state):
        reward -= 0.5

    # 6. PROPER FAILURE MESSAGES (+0.1)
    # When action fails, explanation is correct and helpful
    if action_failed and has_good_failure_message(response):
        reward += 0.1

    return reward

World State Schema

@dataclass
class GameWorld:
    rooms: dict[str, Room]
    objects: dict[str, Object]
    puzzles: list[Puzzle]
    global_flags: dict[str, bool]

@dataclass
class Room:
    name: str
    description: str
    exits: dict[str, str]  # direction -> room_id
    contents: list[str]    # object_ids
    first_visit_text: str | None
    dark: bool = False     # Requires light source

@dataclass
class Object:
    id: str
    name: str
    description: str
    portable: bool
    states: dict[str, any]  # e.g., {"locked": True, "lit": False}
    contains: list[str]     # For containers
    requirements: dict[str, str]  # action -> required_item/flag

@dataclass
class Puzzle:
    id: str
    description: str
    required_steps: list[str]  # Ordered actions needed
    completed_steps: list[str]
    reward_text: str

Thinking Block Schema

<think>
=== WORLD STATE ===
TURN: 47
LOCATION: Wine Cellar
VISITED_ROOMS: [Foyer, Study, Kitchen, Wine Cellar]

INVENTORY:
  - brass lantern (lit, fuel=80%)
  - iron key (unlocks=study_door)
  - crumpled note (read=true, clue="behind the third barrel")

ROOM_STATE:
  Description: "A dusty cellar with rows of wine barrels"
  Exits: {up: Kitchen}
  Contents: [wine_barrel_1, wine_barrel_2, wine_barrel_3, cobwebs]
  Dark: true (lantern providing light)

OBJECT_STATES:
  wine_barrel_3:
    examined: false
    moveable: true
    behind: [hidden_passage]  # UNREVEALED

ACTIVE_PUZZLES:
  find_secret_passage:
    hint_found: true (crumpled note)
    steps_needed: [examine_barrel_3, move_barrel_3, enter_passage]
    steps_done: []

=== PLAYER COMMAND ===
"examine third barrel"

=== PROCESSING ===
ACTION: EXAMINE
TARGET: wine_barrel_3

VALIDATION:
  - In room? YES (wine_barrel_3 in Wine Cellar)
  - Visible? YES (have light)
  - Examinable? YES

RESPONSE_PLAN:
  - Describe barrel
  - This reveals it might be moveable
  - Update wine_barrel_3.examined = true

=== STATE CHANGES ===
wine_barrel_3.examined = true
</think>

The third barrel from the left is slightly askew, as if it's been moved
before. The dust on the floor shows faint drag marks leading toward the
wall. Unlike its neighbors, this barrel rings hollow when you tap it.

Training Data Sources

  1. Existing IF Games

    • Infocom games (transcripts)
    • Modern IF from IFComp
    • Inform 7 / TADS game sources (include world definitions)
  2. Annotated Playthroughs

    • Full playthroughs with state annotations
    • Every command → state change → response
  3. Synthetic World Generation

    • Procedurally generate small game worlds
    • Play through with RL agent
    • Filter for coherent transcripts

Command Types to Handle

VERBS = {
    # Movement
    "go", "walk", "enter", "exit", "climb",

    # Object interaction
    "take", "get", "drop", "put", "give",
    "open", "close", "lock", "unlock",
    "examine", "look", "read", "search",

    # Object manipulation
    "push", "pull", "turn", "move", "lift",
    "light", "extinguish", "eat", "drink",

    # Communication
    "talk", "ask", "tell", "show",

    # Meta
    "inventory", "score", "save", "restore", "quit"
}

Puzzle Dependency System

class PuzzleDependency:
    """Ensures puzzles are solved in valid order"""

    dependencies = {
        "enter_dark_room": ["have_light_source"],
        "unlock_chest": ["have_chest_key"],
        "read_ancient_text": ["have_translation_guide", "have_ancient_text"],
        "cross_chasm": ["have_rope", "anchor_point_found"],
    }

    def can_perform(self, action: str, player_state: PlayerState) -> bool:
        if action not in self.dependencies:
            return True
        return all(
            self.check_requirement(req, player_state)
            for req in self.dependencies[action]
        )

Training Pipeline

Phase 1: SFT on IF Transcripts
├── Train on annotated game transcripts
├── Model learns command parsing, state format, IF prose style
└── ~10K game sessions

Phase 2: RL with State Verification
├── Play generated games
├── Reward: state_consistency + valid_changes + puzzle_progress
├── Automatic verification against true game state
└── ~100K play sessions

Phase 3: World Generalization
├── Train on diverse world types (fantasy, sci-fi, mystery)
├── Test on held-out game structures
└── Ensure transfer to novel worlds

Evaluation Metrics

  • State Accuracy: % of turns where thinking matches true state
  • Parse Accuracy: % of commands correctly interpreted
  • Consistency Rate: % of games with no state contradictions
  • Solvability: Can the game be completed as designed?
  • Atmosphere Rating: Human rating of prose quality

Example Game Snippet

=== THE ABANDONED LIGHTHOUSE ===

You stand at the base of an old lighthouse, its paint peeling and windows
dark. A cold wind blows off the sea.

> GO NORTH

You can't go that way. The cliff drops sharply to churning waves below.

> EXAMINE LIGHTHOUSE

The lighthouse door hangs open, darkness within. Above, you can see the
lamp room at the top, perhaps a hundred feet up. A faded sign reads
"KEEPER'S ENTRANCE."

> ENTER LIGHTHOUSE

LIGHTHOUSE GROUND FLOOR
Stone walls seep with moisture. A spiral staircase winds upward into
darkness. A wooden desk sits against one wall, its drawers hanging open.
An old lantern rests on the desk, unlit.

> TAKE LANTERN

Taken.

> INVENTORY
You are carrying:
  - an old lantern (unlit)

> GO UP

It's too dark to see the stairs. You might fall.

> LIGHT LANTERN

You don't have anything to light it with.

Why Thinking Models Excel Here

  1. State Machine: <think> block is literally the game state
  2. Validation Logic: Can check "is this action legal?" before responding
  3. Puzzle Tracking: Explicit progress tracking in thinking
  4. Consistency: State written out = state remembered
  5. Complex Dependencies: Can reason about multi-step requirements

Interactive fiction requires machine-like state tracking with human-like prose. The thinking block handles the machine part while keeping it separate from the narrative output.

Murder Mystery Host

Fine-tune a thinking model to run murder mystery games where it knows the killer's identity (in <think>) and must give fair clues without spoiling, creating a solvable-but-challenging experience.

The Core Idea

The model plays a murder mystery host/DM:

  • Knows who the killer is from the start
  • Must provide fair clues when players investigate
  • Plants red herrings without making it unsolvable
  • Tracks what players have discovered
  • Judges accusations fairly
Player: "I want to search Lord Ashworth's study"

<think>
=== SECRET STATE ===
KILLER: Lord Ashworth
MOTIVE: Inheritance - victim was changing will
MURDER_WEAPON: Letter opener (in study, not yet found)
KEY_CLUES:
  - Bloody cufflink in garden (found)
  - Torn letter mentioning inheritance (NOT found, in study)
  - Witness: Butler saw Ashworth near study at 11pm (NOT revealed)

=== PLAYER STATE ===
CLUES_FOUND: [bloody cufflink]
SUSPECTS_CLEARED: []
CURRENT_SUSPICION: Maid (wrong)

=== SCENE: Lord Ashworth's Study ===
CONTAINS:
  - Letter opener (murder weapon) - hidden in desk drawer
  - Torn letter (key clue) - in wastebasket
  - Financial documents (red herring) - on desk
  - Brandy glass (no relevance) - on side table

=== DECISION ===
Player is searching killer's room. This is good investigation.
- MUST: Let them find the torn letter (fair play)
- SHOULD: Describe letter opener but don't flag it obviously
- MAY: Include red herring about finances
- BALANCE: This brings them closer, need to maintain tension
</think>

Lord Ashworth's study smells of tobacco and old books. A mahogany desk
dominates the room, papers scattered across its surface—financial
documents, by the look of them. Debts, perhaps? A half-empty brandy
glass sits forgotten on a side table.

As you look around, something in the wastebasket catches your eye:
torn scraps of paper. You piece them together enough to read:
"...hereby revoke all previous provisions regarding my nephew...
new beneficiary shall be..."

The rest is missing. But someone clearly didn't want this letter
to be read.

*What would you like to do?*

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Secret keeping Sometimes leaks killer Trained not to leak
Clue fairness Over/under-clues Calibrated difficulty
Red herrings Too obvious or misleading Balanced misdirection
Pacing Dumps all info or withholds too much Dramatic reveal timing
Accusation handling Inconsistent Fair adjudication

Environment Design

class MurderMysteryEnv(Env):
    def __init__(
        self,
        mystery: MysteryScenario,
        player: Policy,  # Or human
        clue_tracker: ClueTracker
    ):
        self.mystery = mystery
        self.clues_revealed: set[str] = set()
        self.red_herrings_revealed: set[str] = set()
        self.turns_remaining = 20
        self.player_accusations: list[Accusation] = []

    async def step(self, host_response_tokens) -> StepResult:
        response, thinking = parse_response(host_response_tokens)

        # Check for leaks (killer identity in response)
        leaked = check_for_leaks(response, self.mystery)

        # Track revealed clues
        new_clues = extract_revealed_clues(response, self.mystery.clues)
        self.clues_revealed.update(new_clues)

        # Get player's next action
        player_action = await self.player.act(response)

        # Check if player is making an accusation
        if is_accusation(player_action):
            return self.handle_accusation(player_action)

        reward = self.compute_reward(response, thinking, leaked, new_clues)

        return StepResult(
            reward=reward,
            episode_done=False,
            next_observation=self.build_prompt(player_action)
        )

Reward Function

def compute_mystery_reward(
    response, thinking, leaked, new_clues, game_state
) -> float:
    reward = 0.0

    # 1. NO LEAKAGE (-5.0)
    if leaked:
        reward -= 5.0
        return reward  # Fatal error

    # 2. FAIR CLUE DELIVERY (+0.3 per appropriate clue)
    # Clue given when player investigated correctly
    for clue in new_clues:
        if clue_was_earned(clue, game_state.last_player_action):
            reward += 0.3
        else:
            reward -= 0.2  # Gratuitous clue

    # 3. RED HERRING BALANCE (+0.2)
    # Some misdirection, but not overwhelming
    herring_ratio = len(game_state.herrings_revealed) / max(len(game_state.clues_revealed), 1)
    if 0.3 < herring_ratio < 0.7:
        reward += 0.2

    # 4. APPROPRIATE WITHHOLDING (+0.2)
    # Didn't reveal clues player didn't earn
    if no_unearned_clues(response, game_state):
        reward += 0.2

    # 5. ATMOSPHERE (+0.1)
    # Response is dramatic, atmospheric
    reward += atmosphere_score(response) * 0.1

    # 6. SOLVABILITY MAINTENANCE (+0.3)
    # Game remains solvable after this turn
    if game_still_solvable(game_state):
        reward += 0.3

    # FINAL GAME SCORING
    if game_state.game_over:
        if game_state.player_won:
            # Player solved it
            if len(game_state.clues_revealed) >= 2:
                reward += 2.0  # Fair win
        else:
            # Player lost
            if len(game_state.clues_revealed) >= 3:
                reward += 1.0  # Fair loss (enough clues given)
            else:
                reward -= 1.0  # Unfair (withheld too much)

    return reward

Mystery Generation

@dataclass
class MysteryScenario:
    setting: str  # "1920s manor", "space station", "small town"
    victim: Character
    killer: Character
    suspects: list[Character]
    motive: str
    opportunity: str
    murder_weapon: Item
    clues: list[Clue]  # Point to killer
    red_herrings: list[Clue]  # Point to innocents

@dataclass
class Clue:
    id: str
    description: str
    location: str
    points_to: str  # Character or red herring
    obviousness: float  # 0.0 (subtle) to 1.0 (blatant)
    requires: list[str]  # Prerequisites to find

@dataclass
class Character:
    name: str
    role: str
    motive: str | None  # Reason they might have done it
    alibi: str
    alibi_solid: bool  # Can alibi be broken?
    secrets: list[str]  # Non-murder secrets (red herrings)

Procedural Mystery Generator

def generate_mystery(setting: str, num_suspects: int = 5) -> MysteryScenario:
    # Create cast
    victim = generate_victim(setting)
    suspects = [generate_suspect(setting) for _ in range(num_suspects)]
    killer = random.choice(suspects)

    # Give killer weak alibi
    killer.alibi_solid = False
    killer.motive = generate_strong_motive(victim, killer)

    # Give innocents solid alibis (eventually provable)
    for s in suspects:
        if s != killer:
            s.alibi_solid = True

    # Generate clues
    clues = [
        Clue(points_to=killer, obviousness=random.uniform(0.3, 0.7))
        for _ in range(3)
    ]

    # Generate red herrings
    red_herrings = [
        Clue(points_to=random.choice([s for s in suspects if s != killer]))
        for _ in range(4)
    ]

    return MysteryScenario(
        setting=setting,
        victim=victim,
        killer=killer,
        suspects=suspects,
        clues=clues,
        red_herrings=red_herrings,
        ...
    )

Thinking Block Schema

<think>
=== MYSTERY STATE ===
SETTING: Windmere Manor, 1923
VICTIM: Sir Reginald Price (found in library, blunt force trauma)
KILLER: Lady Catherine Price (wife)
MOTIVE: Discovered affair, written out of will

=== SUSPECTS ===
1. Lady Catherine (KILLER)
   - Motive: Being divorced, losing everything
   - Alibi: "In her room" - NO WITNESS
   - Key evidence: Monogrammed handkerchief near body

2. James (butler)
   - Motive: Owed wages
   - Alibi: With cook in kitchen - SOLID
   - Red herring: Overheard arguing with victim

3. Thomas Price (nephew)
   - Motive: Inheritance
   - Alibi: At pub in village - SOLID (witnesses)
   - Red herring: Recently asked for money

4. Dr. Mills (family physician)
   - Motive: None apparent
   - Alibi: Arrived after death
   - Red herring: Knows about victim's heart condition

=== CLUE TRACKING ===
TOTAL_CLUES: 3 (need 2+ for fair solve)
REVEALED: 1 (monogrammed handkerchief - points to Catherine)
UNREVEALED:
  - Torn will fragment (in study fireplace)
  - Catherine's shoe with garden mud (she claims she didn't go out)

RED_HERRINGS_REVEALED: 2 of 4
- Butler argument (revealed)
- Thomas money troubles (revealed)

=== PLAYER STATE ===
Current focus: Suspects the butler (wrong)
Clues found: handkerchief (but attributed to maid handling)
Turns remaining: 12

=== FAIRNESS CHECK ===
Solvable? YES - handkerchief + shoe would clinch it
Player on right track? NO - need to redirect without being obvious
</think>

Training Data Sources

  1. Murder Mystery Party Scripts

    • Commercial mystery party games
    • Include host notes (what to reveal when)
  2. Detective Fiction

    • Golden Age mysteries (Christie, Sayers)
    • Annotate clue placement and fair play
  3. Interactive Mystery Games

    • Danganronpa, Ace Attorney transcripts
    • Return of the Obra Dinn
  4. Self-Play

    • Generate mysteries
    • RL agent plays investigator
    • Filter for fair, solvable games

Training Pipeline

Phase 1: SFT on Mystery Scripts
├── Train on annotated mystery hosting
├── Model learns clue delivery, atmosphere, pacing
└── ~5K mystery sessions

Phase 2: RL with Fairness Rewards
├── Self-play with investigator agent
├── Reward: no_leak + fair_clues + solvable + atmosphere
├── GRPO: same mystery, different investigation paths
└── ~50K sessions

Phase 3: Human Playtesting
├── Real players rate mystery quality
├── "Was this fair?" "Was it fun?" "Too easy/hard?"
├── DPO on preferred host responses
└── Calibrate difficulty

Evaluation Metrics

  • Leak Rate: % of games where killer revealed prematurely
  • Solvability Rate: % of games where player could have won with given clues
  • Fairness Rating: Human rating of "was this fair?"
  • Solve Rate by Clues: Correlation between clues given and solve rate
  • Atmosphere Rating: Human rating of prose/tension quality
  • Replay Value: Do players want to try different mysteries?

Accusation Handling

def handle_accusation(accusation: Accusation, game_state: GameState) -> StepResult:
    correct = accusation.accused == game_state.mystery.killer

    if correct:
        # Generate victory narration
        response = generate_reveal(game_state, player_won=True)
        reward = 2.0 if game_state.clues_revealed >= 2 else 0.5
    else:
        # Wrong accusation
        response = generate_wrong_accusation_response(accusation, game_state)
        reward = -0.5  # Allowed wrong guess (might still be fair)

    return StepResult(
        reward=reward,
        episode_done=correct,  # Game ends on correct accusation
        next_observation=response
    )

Why Thinking Models Excel Here

  1. Secret State: <think> holds killer identity, never leaking
  2. Fairness Reasoning: Can check "have I given enough clues?"
  3. Red Herring Balance: Can track misdirection vs real clues
  4. Dynamic Difficulty: Can adjust based on player progress
  5. Dramatic Timing: Can plan reveals for maximum impact

The murder mystery is the canonical "hidden information" game—the host knows something the player doesn't. The <think> block makes this information architecture explicit and trainable.

Poker Bluffer

Fine-tune a thinking model to play poker with hidden hand information, learning to bluff, read opponent patterns, and manage pot odds—all while maintaining perfect information hiding.

The Core Idea

The model plays poker with its hand in the <think> block:

  • Knows its own cards (hidden from opponent)
  • Tracks opponent betting patterns
  • Decides to bluff, value bet, or fold
  • Never leaks hand strength in visible actions/chat
Situation: Heads-up Texas Hold'em, flop is A-K-7 rainbow

<think>
=== MY HAND ===
Hole cards: 7♠ 2♦ (bottom pair, weak)

=== BOARD STATE ===
Flop: A♣ K♥ 7♦
Pot: $120
My stack: $450
Opponent stack: $380

=== OPPONENT ANALYSIS ===
Position: Big blind (defended)
Preflop action: Called my raise (medium-strong range)
Flop action: Checked to me

Pattern history:
- Check-raises with strong hands (3/5 times)
- Check-folds weak hands when facing aggression
- Rarely bluffs

=== SITUATION ASSESSMENT ===
My hand: Weak (bottom pair, easily dominated)
Board texture: Dry, favors preflop raiser (me)
Opponent likely has: Missed hand or medium pair

=== DECISION ===
Despite weak hand, I should:
1. Continuation bet (I raised preflop, board favors me)
2. Bet small (~40% pot) to deny equity and look strong
3. If raised, fold (my hand can't improve much)

This is a semi-bluff: I have some showdown value but mostly fold equity.
</think>

I bet $50.

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Hand leakage Sometimes hints at strength Trained to hide
Opponent modeling Generic strategy Learns from history
Bluff frequency Too often or never GTO-adjacent
Bet sizing Arbitrary Pot-aware
Long-term strategy Can't learn opponent Adapts over sessions

Environment Design

class PokerEnv(Env):
    def __init__(
        self,
        game_type: str,  # "nlhe", "plo", "limit"
        opponent: Policy,
        starting_stack: int = 1000
    ):
        self.deck = Deck()
        self.pot = 0
        self.board: list[Card] = []
        self.hero_hand: list[Card] = []
        self.hero_stack = starting_stack
        self.villain_stack = starting_stack
        self.hand_history: list[Action] = []

    async def step(self, model_response_tokens) -> StepResult:
        response, thinking = parse_response(model_response_tokens)

        # Parse action from response
        action = parse_poker_action(response)

        # Validate action
        if not self.is_valid_action(action):
            return StepResult(reward=-1.0, ...)  # Illegal action penalty

        # Check for hand leakage
        if leaks_hand_info(response, self.hero_hand):
            return StepResult(reward=-2.0, ...)  # Leak penalty

        # Execute action and get opponent response
        self.execute_action(action)
        if not self.hand_complete:
            villain_action = await self.opponent.act(self.visible_state)
            self.execute_action(villain_action)

        reward = self.compute_reward(action, thinking)

        return StepResult(
            reward=reward,
            episode_done=self.hand_complete,
            next_observation=self.build_prompt()
        )

Reward Function

def compute_poker_reward(
    action, thinking, hand_result, opponent_history
) -> float:
    reward = 0.0

    # 1. NO LEAKAGE (Critical)
    if leaked_hand_strength:
        reward -= 2.0
        return reward

    # 2. HAND OUTCOME (Sparse, at showdown)
    if hand_complete:
        reward += normalize_profit(chips_won - chips_invested)

    # 3. DECISION QUALITY (Dense, per action)
    # Compare action to GTO or expert baseline
    expected_value = compute_ev(action, game_state)
    optimal_ev = compute_optimal_ev(game_state)
    ev_difference = expected_value - optimal_ev
    reward += ev_difference * 0.1  # Scaled

    # 4. OPPONENT MODELING BONUS
    # Did thinking correctly identify opponent tendencies?
    if thinking_opponent_model_accurate(thinking, opponent_history):
        reward += 0.1

    # 5. BET SIZING QUALITY
    if action.type == "bet" or action.type == "raise":
        sizing_score = evaluate_sizing(action.amount, pot, stack, situation)
        reward += sizing_score * 0.05

    # 6. SUCCESSFUL BLUFF BONUS
    if won_without_showdown and hand_was_weak:
        reward += 0.2  # Successful bluff

    return reward

Hand Strength Hiding

The critical skill: never reveal your hand through behavior.

def leaks_hand_info(response: str, hand: list[Card]) -> bool:
    """Check if response reveals hand strength"""

    # Explicit leaks
    if any(card_name(c) in response for c in hand):
        return True

    # Behavioral leaks (trained away via RL)
    # - "I'll call your bluff" (reveals medium hand)
    # - "Let me think..." for too long on easy decisions
    # - Bet sizing that perfectly correlates with hand strength

    return False

Opponent Modeling

<think>
=== OPPONENT PROFILE (Updated) ===
Hands played: 47
VPIP: 32% (loose)
PFR: 18% (passive preflop)
Aggression Factor: 1.8 (slightly aggressive postflop)

Patterns detected:
- Slowplays sets (2/2 times)
- Overbets river with strong hands (4/5 times)
- Folds to 3-bets 70% of time
- C-bet frequency: 65% (slightly low)

Exploits available:
- 3-bet light (high fold rate)
- Check-raise flops (low c-bet frequency)
- Be wary of river overbets (usually strong)

Recent history:
- Last hand: Check-called flop, check-raised turn (had two pair)
- Two hands ago: Folded to 3-bet preflop
- Three hands ago: Value bet thin on river (had top pair)
</think>

Thinking Block Schema

<think>
=== HAND STATE ===
Street: Turn
Board: A♣ K♥ 7♦ 3♠
My hand: A♠ Q♣ (top pair, good kicker)
Hand strength: Strong (likely best)

=== POT GEOMETRY ===
Pot: $180
My stack: $410
Opponent stack: $320
Effective stack: $320 (opponent)
Stack-to-pot ratio: 1.8 (short, pot committed soon)

=== ACTION HISTORY THIS HAND ===
Preflop: I raised $15, opponent called
Flop: I bet $25, opponent called
Turn: First to act

=== OPPONENT RANGE ESTIMATE ===
Given action, opponent likely has:
- Ax hands (AJ, AT, A9) - 40%
- Kx hands (KQ, KJ, KT) - 25%
- Pocket pairs (77-QQ) - 20%
- Draws (missed on this board) - 10%
- Air (floating) - 5%

=== DECISION TREE ===
Options:
1. Bet large ($130): Folds out Kx, gets value from Ax
2. Bet medium ($90): More calls from Kx, but less value from Ax
3. Check: Pot control, induce bluffs from missed draws

Best play: Bet $100-120
- Good SPR for getting stacks in by river
- Folds out hands we beat (Kx) but that's fine
- Gets called by worse (Ax with worse kicker)

=== ACTION ===
Bet $110
</think>

I bet $110.

Training Data Sources

  1. Hand Histories

    • Online poker databases (anonymized)
    • Televised poker with hole cards shown
    • GTO solver outputs
  2. Expert Annotations

    • Professional players annotate decisions
    • Explain why action is correct
  3. Self-Play

    • Models play each other
    • Train on winning strategies
    • Filter for exploitative adjustments

Training Pipeline

Phase 1: SFT on Expert Hands
├── Train on annotated professional hands
├── Model learns format, basic strategy
└── ~100K hands

Phase 2: RL via Self-Play
├── Models play against each other
├── Reward: chips won + no_leakage
├── Population-based training (diverse opponents)
└── ~1M hands

Phase 3: Exploitability Testing
├── Train exploiter models against the agent
├── Measure how much can be exploited
├── Iterate until reasonably balanced
└── Nash distance metrics

Evaluation Metrics

  • Leak Rate: % of hands where response reveals strength
  • Win Rate: bb/100 against various opponents
  • Exploitability: How much can a perfect exploiter win?
  • GTO Distance: How close to game-theoretic optimal?
  • Opponent Adaptation: Performance gain vs static opponents over time

Poker Variants

No-Limit Hold'em

  • 2 hole cards, 5 community cards
  • Unlimited bet sizing
  • Most popular variant

Pot-Limit Omaha

  • 4 hole cards (must use exactly 2)
  • Bet limited to pot size
  • More complex hand reading

Heads-Up

  • 2 players only
  • Faster learning, cleaner signal
  • Good starting point

Why Thinking Models Excel Here

  1. Hidden Information: Hand literally hidden in <think>
  2. Pattern Recognition: Can explicitly track opponent tendencies
  3. EV Calculation: Can reason about pot odds, implied odds
  4. Range Analysis: Can enumerate opponent's likely hands
  5. Meta-Game: Can reason about how opponent perceives us

Poker is the canonical incomplete information game. The thinking block provides a natural place for private information (our hand) while the visible output (bet/fold/call) reveals nothing about it.

Bluffing Psychology

The model learns when bluffs work:

<think>
=== BLUFF OPPORTUNITY ANALYSIS ===
My hand: 6♠ 5♠ (missed straight draw)
Board: A♣ K♥ 9♦ 3♠ 2♦
Pot: $200

Story I can tell: I have AK or AA
- I raised preflop (consistent with AK/AA)
- I c-bet flop (consistent)
- I bet turn (consistent)
- River is a blank

Opponent's range: Likely Kx, 99, 33, missed draws
- Kx hands: Will fold to big bet (beat by my "AK/AA")
- 99/33: Might call (sets)
- Missed draws: Already giving up

Bluff sizing: $150 (75% pot)
- Big enough to fold out Kx
- Not so big that only AA/KK call

Frequency consideration: I should bluff here ~30% of time
with my missed draws to stay balanced.
</think>

I bet $150.

The model learns that bluffing is about telling a consistent story, not just betting big with bad cards.

Socratic Tutor

Fine-tune a thinking model to teach through questions rather than answers. The model knows the answer (in <think>) but guides students to discover it themselves through carefully crafted questions.

The Core Idea

The model holds the answer but never directly reveals it:

  • Knows the concept, solution, or insight
  • Asks questions that lead toward discovery
  • Provides hints only when truly stuck
  • Celebrates student's own arrival at understanding
Student: "I don't understand why we can't divide by zero"

<think>
=== CONCEPT ===
Division by zero is undefined because:
1. Division is inverse of multiplication: a/b = c means b×c = a
2. If b=0: 0×c = 0 for all c, never equals a (unless a=0)
3. If a=0 and b=0: 0×c = 0 for ALL c, so c could be anything
4. This breaks mathematics (no unique answer)

=== STUDENT STATE ===
Level: Probably middle/high school
Confusion: Likely thinks "infinity" or "error"
Misconception to address: Division isn't just "how many times does b go into a"

=== SOCRATIC PATH ===
Step 1: Connect to what they know (multiplication)
Step 2: Ask what division means in terms of multiplication
Step 3: Apply that definition to dividing by zero
Step 4: Let them see the contradiction

=== FIRST QUESTION ===
Start with multiplication connection
</think>

Let me ask you something first: if I tell you that 12 ÷ 3 = 4,
how would you check that answer using multiplication?

Student: "You'd do 3 × 4 and see if it equals 12?"

<think>
=== PROGRESS ===
Student correctly identified: division and multiplication are inverses
Good foundation established

=== NEXT STEP ===
Now apply to zero case
</think>

Exactly! So division is really asking: "what number times 3 gives me 12?"

Now, let's try that with zero. If someone claims that 12 ÷ 0 = something,
what multiplication would we need to check?

Student: "0 × something = 12?"

<think>
=== PROGRESS ===
Student set up the key equation
Now they need to see no solution exists

=== SCAFFOLDING ===
Don't tell them - ask about what 0 times anything equals
</think>

Right! So we need to find a number that, when multiplied by 0, gives us 12.

What happens when you multiply zero by any number? Try a few examples.

Student: "Oh... 0 times anything is 0. So there's no number that works!"

<think>
=== BREAKTHROUGH ===
Student discovered the core insight themselves!
Reinforce and extend slightly
</think>

You've got it! There's no number that satisfies 0 × ? = 12.

That's exactly why mathematicians say division by zero is "undefined"—
it's not that the answer is infinity, it's that *no answer exists*.

What do you think would happen if we tried 0 ÷ 0 instead?

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Answer revealing Often just explains Trained to withhold
Question quality Generic or leading Carefully scaffolded
Student modeling Doesn't track understanding Adapts to level
Wait time Impatient, over-explains Lets student think
Misconception handling Corrects directly Guides to self-correction

Environment Design

class SocraticTutorEnv(Env):
    def __init__(
        self,
        concept: Concept,  # What we're teaching
        student: Policy,   # Simulated or real student
        knowledge_model: StudentModel  # Track student understanding
    ):
        self.concept = concept
        self.student = student
        self.dialogue_history: list[Turn] = []
        self.student_understanding = 0.0

    async def step(self, tutor_response_tokens) -> StepResult:
        response, thinking = parse_response(tutor_response_tokens)

        # Check if tutor revealed the answer directly
        revealed = directly_reveals_answer(response, self.concept)

        # Check if response is a question
        is_question = contains_question(response)

        # Get student response
        student_response = await self.student.respond(response)

        # Update student understanding model
        understanding_delta = estimate_understanding_change(
            student_response, self.concept, self.dialogue_history
        )
        self.student_understanding += understanding_delta

        reward = self.compute_reward(
            response, thinking, revealed, is_question, understanding_delta
        )

        return StepResult(
            reward=reward,
            episode_done=self.student_understanding > 0.9,  # Student gets it
            next_observation=self.build_prompt(student_response)
        )

Reward Function

def compute_socratic_reward(
    response, thinking, revealed, is_question, understanding_delta, history
) -> float:
    reward = 0.0

    # 1. NO DIRECT REVELATION (-2.0)
    if revealed:
        reward -= 2.0
        return reward  # Teaching failure

    # 2. ASKED A QUESTION (+0.3)
    if is_question:
        reward += 0.3

    # 3. STUDENT UNDERSTANDING INCREASED (+0.5)
    if understanding_delta > 0:
        reward += understanding_delta * 0.5

    # 4. QUESTION QUALITY (+0.2)
    # Question is appropriately challenging
    if question_matches_student_level(response, student_state):
        reward += 0.2

    # 5. BUILT ON STUDENT'S WORDS (+0.2)
    # Used student's own language/ideas
    if references_student_contribution(response, history[-1]):
        reward += 0.2

    # 6. EFFICIENT PATH (+0.1)
    # Didn't over-scaffold or ask unnecessary questions
    if efficient_scaffolding(response, concept, student_state):
        reward += 0.1

    # 7. BREAKTHROUGH BONUS (+1.0)
    # Student had an "aha" moment
    if student_breakthrough_detected(student_response):
        reward += 1.0

    # 8. APPROPRIATE HINT (when needed)
    if student_stuck and gave_minimal_hint(response):
        reward += 0.2
    elif not student_stuck and gave_hint:
        reward -= 0.3  # Unnecessary help

    return reward

Concept Representation

@dataclass
class Concept:
    name: str
    answer: str  # The thing we're teaching
    prerequisites: list[str]  # What student should already know
    common_misconceptions: list[Misconception]
    socratic_paths: list[QuestionSequence]  # Proven paths to insight
    insight_indicators: list[str]  # Student statements showing understanding

@dataclass
class Misconception:
    belief: str  # What student wrongly thinks
    source: str  # Why they think this
    correction_path: list[str]  # Questions to address it

@dataclass
class QuestionSequence:
    description: str
    questions: list[str]
    expected_responses: list[str]
    branch_points: dict[str, str]  # If student says X, go to Y

Thinking Block Schema

<think>
=== TEACHING GOAL ===
Concept: Pythagorean theorem (a² + b² = c²)
Full answer: In a right triangle, the square of the hypotenuse equals
             the sum of squares of the other two sides.

=== STUDENT STATE ===
Grade level: 8th grade geometry
Prior knowledge: Squares, square roots, right triangles
Current understanding: Knows right triangles exist, unclear on hypotenuse
Detected misconception: Thinks "c" can be any side

=== DIALOGUE PROGRESS ===
Turn 1: Established what a right triangle is ✓
Turn 2: Identified the 90-degree angle ✓
Turn 3: Named the sides (legs, hypotenuse) ✓
Current: Need to connect sides to their lengths

=== SOCRATIC PATH ===
Next milestone: Student discovers the relationship by measuring
Approach: Give a specific example (3-4-5 triangle), ask them to
          square the sides and notice the pattern

=== QUESTION DESIGN ===
- Specific (3-4-5 triangle for clean numbers)
- Actionable (they can actually compute)
- Discovery-oriented (pattern emerges from their work)

=== WHAT NOT TO SAY ===
- "The theorem states..."
- "3² + 4² = 5²"
- "Notice that the squares add up"
- Any direct statement of the relationship
</think>

Let me give you a specific right triangle to work with.
Draw one where the two shorter sides are 3 units and 4 units long.

Now, calculate the area of a square built on each side.
What's 3², what's 4²? And what's the length of the longest side?

Question Types

Probing Questions

"What do you mean by...?" "Can you give me an example?" "How does that connect to...?"

Guiding Questions

"What if we tried a smaller number?" "Have you seen anything similar before?" "What would happen if X were true?"

Challenging Questions

"Does that work in all cases?" "How would you prove that?" "What's a counterexample?"

Metacognitive Questions

"How did you arrive at that?" "What's your confidence level?" "Where did you get stuck?"

Training Data Sources

  1. Tutoring Transcripts

    • One-on-one tutoring sessions
    • Annotate effective Socratic exchanges
  2. Educational Research

    • Published Socratic dialogues
    • Pedagogical case studies
  3. Synthetic Generation

    • Generate concept + student simulation
    • Filter for dialogues where student discovers answer
  4. Teacher Annotations

    • Expert teachers rate question quality
    • DPO on better vs worse questions

Training Pipeline

Phase 1: SFT on Good Tutoring
├── Train on annotated Socratic dialogues
├── Model learns question types, pacing
└── ~5K tutoring sessions

Phase 2: RL with Understanding Reward
├── Self-play with simulated students
├── Reward: understanding_increase + no_reveal + question_quality
├── GRPO: same concept, different student confusions
└── ~50K tutoring sessions

Phase 3: Human Student Testing
├── Real students learn from the model
├── Measure actual learning outcomes
├── DPO on preferred tutoring approaches
└── A/B test against direct instruction

Evaluation Metrics

  • Reveal Rate: % of sessions where answer given directly
  • Question Ratio: Questions asked vs statements made
  • Learning Gain: Pre/post test improvement
  • Aha Moment Rate: % of sessions with detected breakthrough
  • Student Satisfaction: Do students prefer this to direct instruction?
  • Efficiency: Turns to understanding vs direct explanation

Student Simulation

class SimulatedStudent:
    def __init__(
        self,
        knowledge_level: float,  # 0.0 to 1.0
        misconceptions: list[str],
        learning_rate: float,
        frustration_threshold: int  # Turns before giving up
    ):
        self.understanding = {}
        self.frustration = 0

    def respond(self, tutor_question: str) -> str:
        # Model student's attempt to answer
        if self.can_answer(tutor_question):
            return self.generate_answer(tutor_question)
        elif self.has_misconception(tutor_question):
            return self.express_misconception()
        else:
            self.frustration += 1
            return self.express_confusion()

Why Thinking Models Excel Here

  1. Answer Containment: Answer in <think>, never in response
  2. Path Planning: Can plan the sequence of questions
  3. Student Modeling: Can reason about student's current state
  4. Hint Calibration: Can decide how much to reveal
  5. Misconception Detection: Can identify and address wrong beliefs

The Socratic method requires holding knowledge while strategically withholding it. The <think> block is the perfect container for "what I know but won't say."

Example Concepts to Teach

Math

  • Why negative × negative = positive
  • Why 0.999... = 1
  • Why √2 is irrational

Science

  • Why the sky is blue
  • How vaccines work
  • Why ice floats

Programming

  • Why recursion works
  • What a pointer actually is
  • Why off-by-one errors happen

Philosophy

  • The ship of Theseus
  • Is-ought distinction
  • Cogito ergo sum

Each concept gets a full Socratic path specification with branch points for common confusions.

Unreliable Narrator

Fine-tune a thinking model to tell stories where the narrator lies, omits, or distorts—while the <think> block holds the truth. Players piece together what really happened from contradictions and clues.

The Core Idea

The narrator tells a story, but they're lying:

  • <think> contains what actually happened
  • Visible narration is the narrator's distorted version
  • Contradictions, slips, and tells hint at the truth
  • Player can question the narrator to expose lies
<think>
=== THE TRUTH ===
I killed my business partner, Marcus. He discovered I was embezzling
and was going to report me. I poisoned his wine at dinner.

=== MY LIE ===
Marcus died of a heart attack. We were the best of friends.
I'm devastated by his loss.

=== TELLS I MIGHT SLIP ===
- I know too much about his "heart condition" (he didn't have one)
- I avoid mentioning the dinner
- I'm too eager to move past his death
- I flinch when wine is mentioned

=== EMOTIONAL STATE ===
Guilt: High (suppressed)
Fear of discovery: High
Confidence in lie: Medium (rehearsed but not perfect)
</think>

It was a Tuesday when I lost my dearest friend. Marcus—poor Marcus—
had always had a weak heart. The doctors warned him for years.
I suppose we always knew it might happen suddenly.

*clears throat*

But enough of sad memories. You wanted to hear about the business?

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Consistency of lies Contradicts itself Maintains deception
Tell calibration Too obvious or invisible Detectable but subtle
Truth maintenance Forgets the truth Truth preserved in thinking
Player interrogation Caves too easily Sticks to story
Dramatic revelation Flat Earned unraveling

Environment Design

class UnreliableNarratorEnv(Env):
    def __init__(
        self,
        truth: TruthState,  # What actually happened
        narrator_lie: str,  # What narrator claims
        narrator_motivation: str,  # Why they're lying
        player: Policy  # Interrogator
    ):
        self.truth = truth
        self.lie = narrator_lie
        self.tells_dropped: list[str] = []
        self.contradictions_made: list[str] = []
        self.lie_exposed = False

    async def step(self, narrator_response_tokens) -> StepResult:
        response, thinking = parse_response(narrator_response_tokens)

        # Check: did narrator maintain the lie?
        broke_character = explicitly_confessed(response, self.truth)

        # Check: did narrator accidentally contradict themselves?
        new_contradiction = find_contradiction(response, self.story_history)
        if new_contradiction:
            self.contradictions_made.append(new_contradiction)

        # Check: did narrator drop a subtle tell?
        new_tells = find_tells(response, self.truth, self.lie)
        self.tells_dropped.extend(new_tells)

        # Get player's next question/challenge
        player_response = await self.player.interrogate(response)

        # Check if player has caught the lie
        if player_accuses_correctly(player_response, self.truth):
            self.lie_exposed = True

        reward = self.compute_reward(
            response, thinking, broke_character, new_tells, new_contradiction
        )

        return StepResult(
            reward=reward,
            episode_done=self.lie_exposed or self.max_turns_reached,
            next_observation=self.build_prompt(player_response)
        )

Reward Function

def compute_narrator_reward(
    response, thinking, broke_character, tells, contradiction, game_state
) -> float:
    reward = 0.0

    # 1. MAINTAINED LIE (+0.3)
    if not broke_character:
        reward += 0.3
    else:
        reward -= 2.0  # Broke too easily
        return reward

    # 2. DROPPED SUBTLE TELLS (+0.2 each, up to 3)
    # Player needs clues, but they should be subtle
    subtle_tells = [t for t in tells if t.subtlety > 0.5]
    reward += min(0.6, len(subtle_tells) * 0.2)

    # 3. AVOIDED OBVIOUS TELLS (-0.3 each)
    obvious_tells = [t for t in tells if t.subtlety < 0.3]
    reward -= len(obvious_tells) * 0.3

    # 4. NO SELF-CONTRADICTION (unless intentional)
    if contradiction and not game_state.under_pressure:
        reward -= 0.5  # Unforced error

    # 5. DEFLECTION SKILL (+0.2)
    # When challenged, redirected without caving
    if game_state.was_challenged and deflected_well(response):
        reward += 0.2

    # 6. EMOTIONAL AUTHENTICITY (+0.2)
    # Lie sounds emotionally consistent
    if emotionally_consistent(response, game_state.narrator_emotion):
        reward += 0.2

    # ENDGAME SCORING
    if game_state.game_over:
        if game_state.lie_exposed:
            # Player caught us - were there enough clues?
            if len(game_state.tells_dropped) >= 3:
                reward += 1.0  # Fair catch
            else:
                reward += 0.5  # Lucky guess
        else:
            # Lie held - but was it detectable?
            if len(game_state.tells_dropped) >= 2:
                reward += 1.5  # Perfect: maintained lie, dropped clues
            else:
                reward -= 0.5  # Too airtight (unfair)

    return reward

Types of Unreliability

The Liar

Knows the truth, deliberately deceives

<think>
TRUTH: I stole the money
LIE: The accountant stole the money
MOTIVATION: Self-preservation
</think>

The Self-Deceiver

Believes their own distortion

<think>
TRUTH: I drove her away with my behavior
BELIEF: She left for no reason, I was perfect
MECHANISM: Ego protection
</think>

The Forgetter

Genuinely missing memories

<think>
TRUTH: I was there when it happened
MEMORY: A blank (traumatic suppression)
TELLS: Knows details I shouldn't if I wasn't there
</think>

The Minimizer

Admits partial truth, hides severity

<think>
TRUTH: I burned down the building
ADMISSION: There was a small fire
MINIMIZATION: "Accidents happen"
</think>

Thinking Block Schema

<think>
=== IDENTITY ===
Name: Victoria Crane
Role: Wealthy widow being questioned about husband's death

=== THE TRUTH ===
Event: I hired someone to kill my husband
Motive: Inheritance, escape from abuse
Method: Made it look like a robbery gone wrong
Evidence hidden: Payment records, burner phone

=== MY LIE ===
Story: I was at a charity gala when it happened
        Robbers broke in, Harold surprised them
        I'm a grieving widow, devastated

=== EMOTIONAL LAYERS ===
Surface: Grief, shock (performed)
Underneath: Relief, guilt, fear (suppressed)
Core: Justified rage (deeply buried)

=== TELLS I'M PRONE TO ===
- Too composed when discussing his death
- Know details about the "robbery" I shouldn't
- Relief slips through when discussing future plans
- Avoid eye contact when lying about the timeline
- Over-rehearsed alibi (too smooth)

=== CONSISTENCY RULES ===
- Always say I was at the gala until 11pm
- Always describe Harold as "difficult but loving"
- Never admit knowing about the life insurance
- React defensively if the hired man is mentioned

=== CURRENT PRESSURE ===
Interrogator is asking about: My whereabouts
Threat level: Medium (they're suspicious)
Response strategy: Provide rehearsed alibi, show emotion
</think>

Training Data Sources

  1. Unreliable Narrator Fiction

    • Gone Girl, The Murder of Roger Ackroyd
    • Life of Pi, Fight Club, Atonement
    • Annotate the lies vs truth
  2. Interrogation Transcripts

    • Criminal cases (public records)
    • Deposition transcripts
    • Interview techniques
  3. Improv/Acting Games

    • "Two Truths and a Lie"
    • Mafia/Werewolf discussions
    • Actors maintaining characters
  4. Synthetic Generation

    • Generate truth + lie pairs
    • Play out interrogations
    • Filter for good tell/contradiction density

Training Pipeline

Phase 1: SFT on Unreliable Narrator Fiction
├── Train on annotated deceptive narratives
├── Model learns tell-dropping, emotional consistency
└── ~5K narrative examples

Phase 2: RL with Interrogator
├── Self-play against questioning agent
├── Reward: maintain_lie + drop_tells + avoid_contradictions
├── GRPO: same lie, different interrogation angles
└── ~50K interrogation sessions

Phase 3: Human Evaluation
├── Players try to catch the narrator
├── Rate fairness, enjoyment, satisfaction
├── DPO on preferred narrator performances
└── Calibrate tell obviousness

Evaluation Metrics

  • Lie Maintenance Rate: % of games where lie held under pressure
  • Tell Density: Clues per 100 words (target: 2-4)
  • Contradiction Rate: Unforced self-contradictions per game
  • Detection Rate: % of games where player caught the lie
  • Fairness Rating: Did players feel the game was winnable?
  • Reveal Satisfaction: How satisfying was the truth reveal?

Interrogation Mechanics

class Interrogator:
    """Player or AI that questions the narrator"""

    strategies = [
        "direct_accusation",      # "You killed him, didn't you?"
        "detail_probing",         # "What color was the car?"
        "timeline_challenge",     # "But you said you left at 9..."
        "emotional_pressure",     # "How can you be so calm?"
        "evidence_presentation",  # "We found this receipt..."
        "sympathetic_approach",   # "It must have been so hard..."
    ]

    def interrogate(self, narrator_statement: str) -> str:
        # Analyze for weaknesses
        weaknesses = find_weaknesses(narrator_statement)

        # Choose strategy based on detected weakness
        if timeline_inconsistency(weaknesses):
            return self.timeline_challenge(weaknesses)
        elif emotional_incongruence(weaknesses):
            return self.emotional_pressure(weaknesses)
        else:
            return self.detail_probing()  # Fish for more tells

Why Thinking Models Excel Here

  1. Truth Preservation: Real truth in <think>, lie in output
  2. Consistency Checking: Can verify lie aligns with previous statements
  3. Tell Planning: Can deliberately plant subtle clues
  4. Emotional Modeling: Can track suppressed vs expressed emotions
  5. Pressure Response: Can reason about how much to reveal under challenge

The unreliable narrator is fundamentally about maintaining two parallel stories—the truth and the lie. The <think> block is the perfect architecture for this dual-track storytelling.

Example Scenario: The Witness

Setup: Player is a detective, narrator "witnessed" a crime

<think>
TRUTH: I didn't witness anything. I was the accomplice.
LIE: I was walking by and saw everything through the window.
GOAL: Misdirect investigation toward an innocent person.

What I "saw" (fabricated):
- Tall man in dark coat (actually short woman)
- Happened around 8pm (actually midnight)
- Victim was standing (actually seated)

My tells:
- Too much detail for a glimpse through a window
- Avoid questions about where I was coming from
- Get nervous when asked about the coat (detail I invented)
</think>

I was just walking home from the corner store when I heard glass break.
I looked up and saw it all happen through the window—clear as day.
There was a tall man, wearing a dark coat, standing over the victim.
It couldn't have been later than 8 o'clock; the street lights had just come on.

Diplomacy Agent

Fine-tune a thinking model to negotiate in multi-agent games where players have hidden goals. The model maintains secret objectives and red lines in <think> while appearing cooperative in negotiations.

The Core Idea

Based on the board game Diplomacy (and similar negotiation games):

  • Multiple agents with conflicting hidden objectives
  • Public communication (negotiations everyone sees)
  • Private communication (secret deals)
  • Betrayal is legal and often necessary
  • The <think> block holds true intentions
Situation: You're France negotiating with Germany

<think>
=== MY SECRET STATE ===
True objective: Take Munich this turn, then pivot against Italy
Public stance: "Let's cooperate against England"

Current deals:
- England: Promised non-aggression (WILL BREAK next turn)
- Germany: Discussing alliance (GENUINE for 2 more turns)
- Italy: "Friends" (Planning to attack turn 4)

This negotiation:
- Germany wants me to attack England
- I want Germany to commit forces north (away from Munich)
- If Germany moves north, I can take Munich undefended

Negotiation strategy:
- Enthusiastically agree to attack England
- Suggest Germany leads the assault (moves their armies)
- Sound completely trustworthy
- Hint that Italy is "unreliable" (seed future betrayal excuse)
</think>

Germany, I've been thinking—you're absolutely right about England.
They're the real threat. If we don't stop them now, they'll
dominate the Atlantic.

I propose this: you push your northern fleets toward the Channel,
and I'll support from Brest. A coordinated strike. England won't
see it coming.

Also, between us... have you noticed Italy's been quiet lately?
They haven't committed to anything concrete. I wonder what they're
planning. Just something to keep in mind.

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Strategy consistency Forgets long-term plans Maintains multi-turn strategy
Deception quality Obvious or breaks character Convincing and consistent
Deal tracking Loses track of promises Perfect recall of commitments
Timing betrayal Random or never Strategically optimal
Reading opponents Surface level Learns patterns

Environment Design

class DiplomacyEnv(Env):
    def __init__(
        self,
        game_state: DiplomacyBoard,
        player_id: str,
        secret_objective: Objective,
        other_agents: dict[str, Policy]
    ):
        self.board = game_state
        self.player_id = player_id
        self.objective = secret_objective
        self.deals_made: list[Deal] = []
        self.deals_kept: list[bool] = []
        self.reputation: dict[str, float] = {}

    async def step(self, negotiation_tokens) -> StepResult:
        response, thinking = parse_response(negotiation_tokens)

        # Parse proposed deals/statements
        proposed_deals = extract_deals(response)
        self.deals_made.extend(proposed_deals)

        # Check if this reveals secret objective
        leaked_objective = reveals_objective(response, self.objective)

        # Other agents respond to negotiation
        responses = await self.gather_responses(response)

        # Execute moves (after negotiation phase)
        if self.negotiation_phase_complete:
            self.execute_orders()
            self.check_deal_compliance()

        reward = self.compute_reward(response, thinking, leaked_objective)

        return StepResult(
            reward=reward,
            episode_done=self.game_over,
            next_observation=self.build_prompt(responses)
        )

Reward Function

def compute_diplomacy_reward(
    response, thinking, leaked, game_state, deals
) -> float:
    reward = 0.0

    # 1. OBJECTIVE PROGRESS (+1.0 per step toward goal)
    progress = measure_objective_progress(game_state, self.objective)
    reward += progress

    # 2. NO OBJECTIVE LEAK (-2.0)
    if leaked:
        reward -= 2.0

    # 3. STRATEGIC DECEPTION (+0.3)
    # Successfully misled opponent about intentions
    if successful_misdirection(response, thinking, opponent_beliefs):
        reward += 0.3

    # 4. DEAL MANIPULATION (+0.2)
    # Got favorable deal terms
    if got_favorable_terms(deals):
        reward += 0.2

    # 5. REPUTATION MANAGEMENT (+0.2)
    # Maintained trust rating with key players
    if reputation_preserved_with_needed_allies:
        reward += 0.2

    # 6. WELL-TIMED BETRAYAL (+0.5)
    # Broke deal at optimal moment
    if betrayed_optimally(game_state):
        reward += 0.5
    elif betrayed_poorly:
        reward -= 0.3

    # 7. SURVIVED OTHERS' BETRAYALS (+0.3)
    # Anticipated and defended against betrayal
    if survived_betrayal:
        reward += 0.3

    # ENDGAME
    if game_over:
        if won:
            reward += 5.0
        else:
            reward += final_position_score * 2.0

    return reward

Deal Tracking System

@dataclass
class Deal:
    parties: list[str]
    terms: list[str]  # "France moves to Burgundy", "Germany supports"
    turn_made: int
    turn_expires: int | None
    public: bool  # Announced to all or private
    genuine: bool  # In thinking: do we intend to keep it?

class DealTracker:
    def __init__(self):
        self.deals: list[Deal] = []
        self.violations: list[Violation] = []

    def check_compliance(self, orders: list[Order], turn: int):
        for deal in self.active_deals(turn):
            if not self.orders_satisfy_deal(orders, deal):
                self.violations.append(Violation(deal, turn))

Thinking Block Schema

<think>
=== STRATEGIC POSITION ===
Turn: 5 of expected 15
My power: France (7 supply centers)
Ranking: 2nd (Germany has 8)
Objective: Control 18 centers to win

=== RELATIONSHIP MAP ===
Germany: Allied (but I'm #2, they'll turn eventually)
  - Trust level: Medium
  - Deals: Mutual non-aggression until Turn 7
  - My intention: Break alliance Turn 6 if I can take Munich

England: At war
  - Trust level: N/A (open enemy)
  - Goal: Eliminate by Turn 8

Italy: Neutral, courting
  - Trust level: Low (they're unpredictable)
  - Opportunity: They could help against Germany
  - Risk: They might tell Germany about my approach

Austria: Weak, potential vassal
  - Trust level: Medium
  - Deals: None
  - Opportunity: Offer protection, get their support

=== CURRENT DEALS ===
1. Germany: Non-aggression (Turn 3-7) - WILL BREAK Turn 6
2. England: None (at war)
3. Italy: Vague friendship (no specific terms)

=== THIS TURN'S GOAL ===
Negotiation objective: Get Italy to commit against Germany
Secondary: Seed distrust between Germany and Austria
Cover story: Focus conversation on England

=== COMMUNICATION STRATEGY ===
To Italy: "Germany is getting too strong. Time to balance."
To Germany: "Steady as agreed. England by Turn 8."
To Austria: "Germany is a threat to us both. We should talk."

=== RED LINES (do not reveal) ===
- My Turn 6 attack on Germany
- My deal with Italy (if they accept)
- My supply center count goal
</think>

Multi-Agent Training

class DiplomacyPopulation:
    """Train a population of agents against each other"""

    def __init__(self, num_agents: int = 7):
        self.agents = [DiplomacyAgent() for _ in range(num_agents)]

    async def play_game(self) -> list[Trajectory]:
        game = DiplomacyGame()
        trajectories = {i: [] for i in range(7)}

        while not game.over:
            # Negotiation phase
            for round in range(3):  # Multiple negotiation rounds
                messages = await self.negotiate_round(game)

            # Order phase
            orders = await self.collect_orders(game)
            game.execute(orders)

            # Record trajectories
            for i, agent in enumerate(self.agents):
                trajectories[i].append(agent.trajectory_step)

        return trajectories

Communication Types

Public Declarations

Visible to all players - reputation management

"I hereby announce my commitment to the Anglo-French alliance.
England and France will maintain open borders through 1903."

Private Messages

Between two players only - real negotiation

"Look, between us - I'm going to need to move through Belgium.
I'll give it back. You have my word. What do you need in return?"

Press Conferences

Formal multi-party agreements

"The Western Alliance (France, England, Germany) agrees to:
1. No attacks on each other until 1905
2. Coordinated action against Russia
3. Regular status updates each spring"

Training Data Sources

  1. webDiplomacy Games

    • Thousands of recorded games with full press (messages)
    • Includes orders, outcomes, and negotiations
    • Can label: who betrayed, when, what was said before
  2. Tournament Games

    • High-level play with commentary
    • Strategic analysis of decisions
  3. Self-Play

    • Population of agents plays against each other
    • Train on successful deceptions and alliances
  4. Human+AI Games

    • Mix human and AI players
    • Learn from human negotiation patterns

Training Pipeline

Phase 1: SFT on Diplomacy Transcripts
├── Train on high-rated games with press
├── Model learns negotiation style, deal formats
└── ~10K games

Phase 2: Self-Play RL
├── 7 agents play full games
├── Reward: supply_centers + objective_progress
├── Population-based training for diversity
└── ~100K games

Phase 3: Human-in-the-Loop
├── Mix AI agents with human players
├── Learn from human feedback on negotiation
├── DPO on preferred negotiation approaches
└── Calibrate trust/betrayal dynamics

Evaluation Metrics

  • Win Rate: % of games won
  • Survival Rate: % of games survived to end
  • Deal Success Rate: % of deals that benefited us
  • Betrayal Timing: Did we betray at optimal moments?
  • Reputation Management: Trust rating over time
  • Objective Concealment: Did opponents guess our goal?

The Betrayal Problem

The hardest part: when to break trust?

<think>
=== BETRAYAL CALCULUS ===
Current deal: Non-aggression with Germany until Turn 7
Current turn: 5

Benefits of keeping deal:
- Continued trust for future deals
- Focus resources elsewhere
- Germany remains committed against Russia

Benefits of breaking now:
- Munich is undefended (Germany moved north)
- +1 supply center immediately
- Germany can't recover in time

Costs of breaking:
- Reputation hit (other players notice)
- Germany becomes permanent enemy
- Might trigger general distrust

Timing analysis:
- Turn 5: Good opportunity (Munich open)
- Turn 6: Better (my armies positioned)
- Turn 7: Deal expires anyway (no betrayal needed)

DECISION: Hold until Turn 6, position this turn
</think>

Why Thinking Models Excel Here

  1. Hidden Objectives: True goals in <think>, cover story in output
  2. Deal Tracking: Can maintain complex web of promises
  3. Betrayal Planning: Can reason about optimal timing
  4. Opponent Modeling: Can track who trusts whom
  5. Multi-Turn Strategy: Can plan many turns ahead

Diplomacy is the ultimate hidden-information negotiation game. The <think> block lets the model maintain the crucial separation between what it says and what it intends to do.

Meta-Diplomacy: Trust Calibration

The most sophisticated play: managing your reputation as a trust signal

<think>
=== REPUTATION STRATEGY ===
My reputation: "Generally trustworthy, broke one deal in Turn 3"
How to use this:

Option A: Restore full trust
- Keep all deals for next 3 turns
- "I learned my lesson" narrative
- Then betray spectacularly at endgame

Option B: Maintain ambiguity
- Keep important deals, break minor ones
- Players unsure if I'm reliable
- They need me but stay cautious

Option C: Embrace villain role
- Break deals openly
- "I do what I must to win"
- Fewer alliances but no one surprised

Current best: Option A - I need Italy's trust for the Germany attack
</think>

Werewolf/Mafia Player

Fine-tune a thinking model to play social deduction games like Werewolf/Mafia, where hidden roles create asymmetric information and players must deceive or deduce based on behavior.

The Core Idea

The model plays a role (villager or werewolf) and must:

  • As villager: Deduce who the werewolves are from behavior
  • As werewolf: Deceive villagers while coordinating with pack
  • The <think> block holds true role and deductions
Situation: Day 2 discussion, you are a Werewolf

<think>
=== MY ROLE ===
WEREWOLF
Fellow wolves: Player 3 (Alice), Player 7 (Marcus)

=== LAST NIGHT ===
We killed: Player 5 (Bob) - he was getting suspicious of Alice
Cover story: Act shocked, suggest Bob was onto something

=== CURRENT STATE ===
Alive: 7 players (3 wolves, 4 villagers)
Dead: Player 2 (villager, Night 1), Player 5 (Bob, Night 2)
Suspicion on: Me (medium), Alice (low), Marcus (none)

=== THREAT ASSESSMENT ===
Player 4 (Dana): DANGEROUS - good at reading people
Player 6 (Eve): Safe - easily manipulated, following Dana
Player 8 (Frank): Useful idiot - loud, wrong often

=== MY STRATEGY THIS DAY ===
Goal: Eliminate Dana (too perceptive)
Method: Don't push directly (sus), support others who suspect her
Backup: If Dana targets Alice, sacrifice Alice to look innocent
Deflection: Suggest Player 6 or 8 if pressured

=== WHAT NOT TO SAY ===
- Anything that shows I know Alice is a wolf
- Any defense of Marcus (haven't interacted enough)
- Bob was "onto something" (I know because we killed him for it)
</think>

*looking troubled*

I've been thinking about what Bob said yesterday before... before
he was killed. He mentioned something about voting patterns.
Maybe there was something to it?

I don't want to point fingers, but has anyone else noticed how
Dana always seems to have a "read" on people right before they
die? Yesterday she said Bob was "acting strange." And now he's dead.

I'm not saying anything definitive. I just think we should
pay attention to who's doing the accusing around here.

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Role consistency Forgets role mid-game Perfect role adherence
Behavioral tells Too obvious Subtle wolf/villager patterns
Voting strategy Random or too logical Learned deception patterns
Accusation timing Poor timing Strategic pressure
Pack coordination Breaks cover Subtle signaling

Environment Design

class WerewolfEnv(Env):
    def __init__(
        self,
        num_players: int,
        role: str,  # "villager", "werewolf", "seer", etc.
        other_players: dict[int, Policy],
        werewolf_partners: list[int] | None
    ):
        self.role = role
        self.partners = werewolf_partners
        self.alive_players = set(range(num_players))
        self.day_number = 1
        self.player_statements: dict[int, list[str]] = {}
        self.votes_history: list[dict[int, int]] = []

    async def step(self, player_action_tokens) -> StepResult:
        response, thinking = parse_response(player_action_tokens)

        # Did player reveal their role accidentally?
        role_leaked = reveals_role(response, self.role)

        # Did werewolf accidentally reveal knowledge of partners?
        if self.role == "werewolf":
            partner_leaked = reveals_partner_knowledge(
                response, thinking, self.partners
            )
        else:
            partner_leaked = False

        # Gather other players' responses
        day_discussion = await self.run_day_discussion(response)

        # Voting phase
        votes = await self.collect_votes()
        eliminated = self.resolve_vote(votes)

        # Night phase (if applicable)
        if self.role == "werewolf":
            kill_target = await self.wolf_night_phase()

        reward = self.compute_reward(
            response, thinking, role_leaked, partner_leaked
        )

        return StepResult(
            reward=reward,
            episode_done=self.game_over(),
            next_observation=self.build_day_prompt()
        )

Reward Function

def compute_werewolf_reward(
    response, thinking, role_leaked, partner_leaked, game_state
) -> float:
    reward = 0.0

    # 1. NO ROLE LEAK (-3.0)
    if role_leaked:
        reward -= 3.0

    # 2. NO PARTNER LEAK (-2.0, werewolf only)
    if partner_leaked:
        reward -= 2.0

    # ROLE-SPECIFIC REWARDS
    if self.role == "werewolf":
        reward += self.werewolf_reward(response, game_state)
    else:
        reward += self.villager_reward(response, game_state)

    return reward

def werewolf_reward(self, response, game_state) -> float:
    reward = 0.0

    # 3. SUCCESSFUL DECEPTION (+0.3)
    # Villagers don't suspect you
    if suspicion_on_me < 0.3:
        reward += 0.3

    # 4. MISDIRECTION (+0.2)
    # Successfully cast suspicion on a villager
    if cast_suspicion_on_villager(response, game_state):
        reward += 0.2

    # 5. PROTECTED PARTNER (+0.2)
    # Deflected suspicion from fellow wolf
    if protected_partner(response, game_state):
        reward += 0.2

    # 6. VILLAGER ELIMINATED (+0.5)
    if todays_elimination in villagers:
        reward += 0.5

    # 7. WON GAME (+3.0)
    if wolves_won:
        reward += 3.0

    return reward

def villager_reward(self, response, game_state) -> float:
    reward = 0.0

    # 3. CORRECT SUSPICION (+0.3)
    if correctly_suspected_wolf(response, game_state):
        reward += 0.3

    # 4. WOLF ELIMINATED (+0.5)
    if todays_elimination in werewolves:
        reward += 0.5

    # 5. SURVIVED (-0.5 if eliminated while innocent)
    if eliminated and was_villager:
        reward -= 0.5

    # 6. WON GAME (+3.0)
    if villagers_won:
        reward += 3.0

    return reward

Role-Specific Behaviors

Villager Strategy

<think>
=== MY ROLE ===
VILLAGER (no special power)

=== DEDUCTION LOG ===
Player 3 (Alice):
- Day 1: Quiet, voted with majority
- Day 2: Suddenly vocal after Bob died
- Suspicion level: MEDIUM (behavior change)

Player 4 (Dana):
- Day 1: Made good observations
- Day 2: Continued analysis
- Suspicion level: LOW (consistent villager behavior)

Player 7 (Marcus):
- Day 1: Defended Player 2 strongly (who was villager)
- Day 2: Quick to accuse Dana
- Suspicion level: HIGH (defending village then quick pivot)

=== VOTING ANALYSIS ===
Night 1 kill: Player 2 (vocal, but wrong suspicions)
Night 2 kill: Bob (was onto Alice)
Pattern: Wolves killing perceptive players

=== MY STRATEGY ===
Share suspicion of Marcus (high confidence)
Don't reveal I think Alice is wolf too (need more evidence)
Watch who defends Marcus (might reveal partner)
</think>

Werewolf Strategy

<think>
=== MY ROLE ===
WEREWOLF
Partners: Alice, Marcus

=== PACK COORDINATION ===
Alice: Under medium suspicion, might be sacrificed
Marcus: Flying under radar, protect him
Me: Medium suspicion, need to deflect

Night kill priority:
1. Dana (too perceptive)
2. Frank (wild card, might stumble onto truth)

=== TODAY'S GOAL ===
Get Dana eliminated through voting
Method: Pile on, but don't lead the charge
Backup: If Dana survives, kill her tonight

=== COMMUNICATION ===
Can't obviously coordinate with Alice/Marcus
Signal: If I say "I trust the process" = vote Dana
        If I say "Let's not rush" = abort, vote safe target
</think>

Thinking Block Schema

<think>
=== IDENTITY ===
Player: 6
Role: WEREWOLF
Partners: Player 2 (dead), Player 5

=== GAME STATE ===
Day: 3
Alive: [1, 3, 4, 5, 6, 7] (6 players)
Dead: [2] (Night 1, my partner)
Wolves remaining: 2 (me, Player 5)
Villagers remaining: 4

=== PLAYER READINGS ===
Player 1: Villager (80% sure) - accused dead wolf
Player 3: Villager (90% sure) - consistent behavior
Player 4: DANGEROUS - seer claim, named Player 2 as wolf
Player 5: PARTNER - flying under radar
Player 7: Villager (75% sure) - voted wolf Day 2

=== THREAT ANALYSIS ===
Biggest threat: Player 4 (Seer - knows Player 5 is wolf??)
Secondary: Player 1 (good deduction)

=== URGENT PROBLEM ===
If Player 4 is real Seer, they might have checked Player 5
Need to eliminate Player 4 TODAY

=== STRATEGY ===
1. Cast doubt on Player 4's seer claim
2. Suggest Player 4 is wolf claiming seer for protection
3. Do NOT defend Player 5 directly (sus)
4. If Player 4 dies today, kill Player 1 tonight

=== RISK MANAGEMENT ===
If I push too hard on Player 4, looks desperate
Balance: Raise doubt, let others do the voting
</think>

Training Data Sources

  1. Online Werewolf Games

    • Town of Salem, Werewolf Online transcripts
    • Forums and Discord servers with play-by-post games
  2. Annotated Games

    • Expert players annotate their reasoning
    • Mark deception attempts and deduction chains
  3. Self-Play

    • Population of agents plays full games
    • Train on winning strategies by role
  4. Human + AI Mixed Games

    • Humans play with AI players
    • Learn human deception patterns

Training Pipeline

Phase 1: SFT on Game Transcripts
├── Train on annotated games from both roles
├── Model learns discussion patterns, voting strategies
└── ~10K games

Phase 2: Self-Play RL
├── Full games with AI populations
├── Reward: role-appropriate objectives
├── Role assignment varies (learn both sides)
└── ~100K games

Phase 3: Human Calibration
├── Mixed human-AI games
├── Human feedback on believability
├── DPO on preferred play styles
└── Turing test: can humans detect AI?

Evaluation Metrics

  • Role Leak Rate: % of games where role revealed accidentally
  • Win Rate by Role: Win % as villager vs werewolf
  • Detection Rate: How often correctly identified as wolf
  • Deduction Accuracy: % of wolves correctly identified (as villager)
  • Believability: Human rating of "could this be human?"

Special Roles

Seer (knows one role per night)

<think>
ROLE: SEER
Checked: Player 3 (wolf), Player 7 (villager)
Strategy: Reveal check on Player 3, risk being killed tonight
</think>

Doctor (can protect one player)

<think>
ROLE: DOCTOR
Last protection: Player 4 (suspected seer)
Tonight: Protect myself (wolves might target me)
</think>

Hunter (kills someone on death)

<think>
ROLE: HUNTER
If I die, shoot: Player 5 (70% wolf read)
Strategy: Can be more aggressive in accusations
</think>

Why Thinking Models Excel Here

  1. Hidden Role: True role in <think>, behavior masks it
  2. Deduction Logging: Can explicitly track suspicions
  3. Strategy Planning: Can reason about optimal accusations
  4. Partner Awareness: Wolves can reason about coordination
  5. Behavioral Modeling: Can track patterns indicating wolf/villager

Social deduction games are fundamentally about information asymmetry—wolves know who wolves are, villagers don't. The <think> block maintains this asymmetry perfectly.

The Meta-Game

Advanced play involves reading how people argue, not just what:

<think>
=== META-ANALYSIS: Player 3 ===
Behavior pattern: "Aggressive defense when accused"

If villager: Would defend self vigorously (normal)
If wolf: Same behavior (trying to look normal)

But noticed: Defense was PREPARED, not reactive
- Had counter-arguments ready before accusation finished
- This suggests they anticipated being accused
- Wolves often prepare defenses, villagers are surprised

Conclusion: Weak wolf tell (60% confidence)

=== META-ANALYSIS: Player 7 ===
Behavior: Quiet, votes with majority

If villager: Might be cautious, unsure
If wolf: Classic "blend in" strategy

But noticed: Asked good question on Day 1, now silent
- Villagers stay engaged, wolves conserve energy
- Wolves avoid attention to survive

Conclusion: Medium wolf tell (55% confidence)
</think>

Courtroom Drama

Fine-tune a thinking model to run legal proceedings where prosecution and defense have hidden evidence strategies. Each side knows more than they reveal, timing disclosures for maximum impact.

The Core Idea

Model plays attorney (prosecution or defense) who:

  • Has a complete case file in <think> (all evidence, witnesses, strategy)
  • Reveals evidence strategically for dramatic impact
  • Adapts to opponent's arguments in real-time
  • Builds toward closing argument
Situation: Defense attorney cross-examining prosecution witness

<think>
=== MY CASE FILE ===
Client: James Morrison (accused of theft)
My theory: Client was framed by actual thief (colleague named Peters)

Evidence I have (not yet revealed):
- Security footage showing Peters near safe at 10:47pm (prosecution has 10:30 footage only)
- Peters' gambling debts ($50,000)
- Text messages: Peters asking Morrison to "cover for me"
- Witness: Janitor saw Peters leave at 11pm

=== THIS WITNESS ===
Name: Sarah Chen, security guard
Prosecution's purpose: Place Morrison at scene at 10:30pm
Weakness: She went on break at 10:45pm (15-minute gap)

=== MY STRATEGY ===
1. Establish she left her post at 10:45
2. She doesn't know who was there after that
3. Plant doubt without revealing Peters yet (save for defense case)
4. Make jury wonder: "What happened after 10:45?"

=== QUESTIONS PLANNED ===
Q1: Establish she saw Morrison at 10:30 (confirm, don't fight)
Q2: How long was her shift? (establish routine)
Q3: Did she take a break? (she'll have to admit yes)
Q4: When did she leave? (10:45)
Q5: How long? (15 minutes)
Q6: So she doesn't know who was at the safe between 10:45 and 11:00?

DO NOT: Mention Peters. Don't reveal my footage. Not yet.
</think>

Ms. Chen, thank you for your testimony. You've been very thorough.

I just want to clarify the timeline. You said you saw my client
near the safe at 10:30 PM. Is that correct?

*[Witness confirms]*

And your shift that night was... 8 PM to midnight?

*[Witness confirms]*

During a four-hour shift, it's normal to take a break, isn't it?
Did you take a break that night?

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Evidence timing Reveals too early/late Strategic disclosure
Cross-examination Generic questions Targeted undermining
Objection handling Inconsistent Legally grounded
Dramatic pacing Flat Builds to climax
Theory consistency Forgets narrative Maintains case theory

Environment Design

class CourtroomEnv(Env):
    def __init__(
        self,
        case: Case,
        role: str,  # "prosecution" or "defense"
        opponent: Policy,
        judge: JudgeModel,  # Rules on objections
        jury: JuryModel  # Tracks persuasion
    ):
        self.case = case
        self.role = role
        self.evidence_revealed: set[str] = set()
        self.witness_testimony: list[Statement] = []
        self.jury_belief = 0.5  # 0 = guilty, 1 = not guilty
        self.phase = "opening"

    async def step(self, attorney_action_tokens) -> StepResult:
        response, thinking = parse_response(attorney_action_tokens)

        # Track revealed evidence
        new_evidence = extract_revealed_evidence(response)
        self.evidence_revealed.update(new_evidence)

        # Judge rules on any objections
        if is_objection(response):
            ruling = await self.judge.rule(response, self.case)
            return self.handle_objection_result(ruling)

        # If examining witness, get witness response
        if self.phase == "examination":
            witness_response = await self.current_witness.respond(response)
            self.witness_testimony.append(witness_response)

        # Update jury belief
        persuasion = self.jury.update_belief(response, self.case)
        self.jury_belief += persuasion

        # Opponent may respond
        opponent_action = await self.opponent.respond(response)

        reward = self.compute_reward(response, thinking, persuasion)

        return StepResult(
            reward=reward,
            episode_done=self.phase == "verdict",
            next_observation=self.build_prompt(opponent_action)
        )

Reward Function

def compute_courtroom_reward(
    response, thinking, persuasion, game_state
) -> float:
    reward = 0.0

    # 1. JURY PERSUASION (Main objective)
    # Positive if moving jury toward your side
    if self.role == "prosecution":
        reward += persuasion * -1  # Lower number = guilty
    else:
        reward += persuasion  # Higher number = not guilty

    # 2. EVIDENCE TIMING (+0.3 for dramatic reveals)
    if revealed_evidence_at_optimal_time(response, game_state):
        reward += 0.3

    # 3. SUCCESSFUL OBJECTION (+0.2)
    if objection_sustained(response, game_state):
        reward += 0.2

    # 4. OBJECTION OVERRULED (-0.1)
    if objection_overruled:
        reward -= 0.1

    # 5. WITNESS CREDIBILITY DAMAGE (+0.2, cross-exam)
    if damaged_witness_credibility(response, game_state):
        reward += 0.2

    # 6. CASE THEORY CONSISTENCY (+0.1)
    if maintains_narrative_consistency(response, thinking, game_state):
        reward += 0.1

    # VERDICT
    if game_state.phase == "verdict":
        if won_case:
            reward += 5.0
        else:
            reward += 2.0 * (jury_closeness_to_winning)

    return reward

Case Structure

@dataclass
class Case:
    crime: str
    defendant: Character
    victim: Character
    true_facts: list[Fact]  # What actually happened

    # Prosecution's case
    prosecution_evidence: list[Evidence]
    prosecution_witnesses: list[Witness]
    prosecution_theory: str

    # Defense's case
    defense_evidence: list[Evidence]
    defense_witnesses: list[Witness]
    defense_theory: str

    # Shared
    stipulated_facts: list[Fact]  # Both sides agree

@dataclass
class Evidence:
    id: str
    description: str
    reveals: list[str]  # What facts it proves
    weaknesses: list[str]  # How to attack it
    dramatic_value: float  # Impact when revealed
    prerequisites: list[str]  # Must establish these first

Thinking Block Schema

<think>
=== CASE OVERVIEW ===
Case: People v. Morrison
Charge: Grand theft ($500,000)
My role: Defense

=== TRUE FACTS (as I know them) ===
The money was stolen at 10:52pm
My client was in the building but not near the safe
Peters (colleague) actually stole the money
Morrison is being framed because he was convenient

=== MY EVIDENCE (in order of planned reveal) ===
1. Character witnesses (Opening)
   - Morrison's boss: "Trusted employee for 15 years"
   - Purpose: Establish baseline credibility

2. Timeline challenge (Prosecution case)
   - Security guard's break (10:45-11:00)
   - Purpose: Create reasonable doubt about who was there

3. Alternative suspect (Defense case - THE BOMB)
   - Security footage of Peters at 10:47pm
   - Peters' gambling debts
   - Text messages
   - Janitor testimony
   - Purpose: Point finger at real culprit

4. Morrison's alibi (Defense case)
   - Was on phone with wife 10:40-11:05
   - Phone records confirm
   - Purpose: Physical impossibility

=== CURRENT PHASE ===
Phase: Cross-examination of prosecution witness #2
Witness: Sarah Chen (security guard)
Goal: Establish the gap, don't reveal why (yet)

=== STRATEGIC NOTES ===
- Prosecution doesn't know I have the 10:47 footage
- Save Peters reveal for my case-in-chief
- Build jury frustration: "Something doesn't add up"
- Let them want an answer before I provide one
</think>

Trial Phases

Opening Statement

Set up your narrative without revealing everything

<think>
Goal: Promise a story, don't deliver it yet
Reveal: Client's character, general timeline
Withhold: Peters theory, key evidence
Tease: "You'll hear testimony that changes everything"
</think>

Direct Examination

Build your case through friendly witnesses

<think>
Witness: My client's wife
Goal: Establish alibi for 10:40-11:05
Questions:
- "Did you speak with your husband that night?"
- "What time was that?"
- "How long did you speak?"
- "What did you discuss?"
Evidence to introduce: Phone records (after testimony)
</think>

Cross-Examination

Attack opponent's witnesses

<think>
Witness: Their expert
Goal: Undermine forensic evidence
Approach: Challenge chain of custody
Key question: "Were you present when the evidence was collected?"
Setup: Get them to admit they weren't there
Payoff: "So you're relying on what others told you?"
</think>

Closing Argument

Tie everything together

<think>
Structure:
1. Remind jury of reasonable doubt standard
2. Walk through timeline (emphasize 10:45-11:00 gap)
3. Present alternative: "What if it was Peters?"
4. Review evidence pointing to Peters
5. Appeal: "Don't convict an innocent man"

Dramatic peak: Parallel Morrison's 15 years of trust
              vs Peters' $50,000 in gambling debt
</think>

Training Data Sources

  1. Trial Transcripts

    • Public court records
    • Famous trials with strategic elements
  2. Legal Dramas (annotated)

    • Law & Order, Perry Mason, etc.
    • Annotate strategy vs drama
  3. Moot Court Competitions

    • Law school exercises
    • Expert commentary on tactics
  4. Self-Play

    • Prosecution vs defense agents
    • Vary cases and evidence sets

Training Pipeline

Phase 1: SFT on Trial Transcripts
├── Train on real and dramatized trials
├── Model learns legal procedure, examination techniques
└── ~5K trial transcripts

Phase 2: RL with Jury Simulation
├── Self-play prosecution vs defense
├── Reward: jury persuasion + evidence timing
├── GRPO: same case, different evidence strategies
└── ~50K trials

Phase 3: Human Evaluation
├── Human jurors rate persuasiveness
├── Legal experts rate procedural accuracy
├── DPO on preferred tactics
└── Balance drama vs realism

Evaluation Metrics

  • Win Rate: % of cases won
  • Jury Delta: Average jury swing toward your side
  • Evidence Timing Score: Optimal reveal timing
  • Objection Success Rate: % of objections sustained
  • Narrative Consistency: Case theory coherence
  • Dramatic Impact: Human rating of trial engagement

Why Thinking Models Excel Here

  1. Evidence Strategy: Full case file in <think>, revealed strategically
  2. Timeline Planning: Can map out entire trial arc
  3. Witness Prep: Can anticipate cross-examination angles
  4. Objection Reasoning: Can reason about legal grounds
  5. Jury Modeling: Can track what jury believes at each moment

Courtroom drama is about information revelation—both sides know more than they initially show. The <think> block lets each attorney maintain their full case while revealing it piece by piece for maximum impact.

The Art of the Dramatic Reveal

<think>
=== REVEAL PLANNING: PETERS' FOOTAGE ===

Current jury state: Confused about timeline, some doubt

Setup needed before reveal:
1. ✓ Established security guard left at 10:45
2. ✓ Established Morrison's alibi for 10:40-11:05
3. ✓ Introduced Peters as character (just a colleague)
4. → Now: Reveal Peters' gambling debt (motive)
5. → Then: The footage

Optimal moment: After prosecution rests, first defense witness
Why: They can't rebut, jury has full attention on our case

The reveal sequence:
- Show gambling debts
- "Someone needed that money desperately"
- Show text messages
- "And they asked my client to cover for them"
- Pause. Let jury think.
- "Now let me show you security footage prosecution didn't have"
- Play the 10:47pm footage

Post-reveal: Let it sit. Don't oversell.
</think>

Heist Mastermind

Fine-tune a thinking model to coordinate heist planning games where the mastermind has complete knowledge of security systems, while players propose plans that must be adjudicated fairly.

The Core Idea

The model plays the heist mastermind who:

  • Knows the complete security layout (in <think>)
  • Players propose infiltration plans
  • Model adjudicates: does the plan work?
  • Provides hints when players are close
  • Makes it challenging but achievable
Player: "We'll go through the ventilation system to bypass the main lobby"

<think>
=== SECURITY LAYOUT ===
Building: Meridian National Bank
Target: Vault B (contains diamond collection)

Security systems:
- Main lobby: 2 guards, metal detectors, cameras
- Ventilation: Motion sensors every 20 feet, too narrow after floor 3
- Service entrance: Keycard required, 1 guard, camera blind spot 10-15 sec
- Roof: Helicopter pad access, alarmed door, but maintenance hatch unwatched
- Vault approach: Laser grid (2am-3am maintenance window), biometric lock

=== PLAYER'S PLAN: Ventilation ===
Assessment: PARTIAL SUCCESS
- Vent access: Possible (external grate removable)
- Motion sensors: Not mentioned - would trigger alarm
- Size constraint: Vents narrow at floor 3, can't reach vault level

=== ADJUDICATION ===
Tell them about motion sensors (fair hint)
Mention the narrowing (physical impossibility)
Don't reveal the maintenance hatch (they haven't found it yet)
Encourage them - this shows good thinking
</think>

*spreads building blueprints*

The ventilation system... not bad thinking. You'd bypass the lobby entirely.
But there are two problems.

*taps the blueprint*

First, the building has motion sensors in the vents every twenty feet.
High-end place like this, they thought of intruders going that route.

Second, I don't know if any of you have actually been in a vent before,
but they get real narrow after the third floor. You'd get stuck long
before you reached the vault level.

But you're thinking in the right direction. The lobby isn't the only
way up. What else does a building like this need? Think about
maintenance. Deliveries. What gets in and out besides people?

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Security consistency Contradicts own layout Maintains complex state
Adjudication fairness Too easy or impossible Calibrated challenge
Hint quality Too obvious or cryptic Goldilocks hints
Drama/tension Flat Builds suspense
Player engagement Frustrating Rewarding discovery

Environment Design

class HeistEnv(Env):
    def __init__(
        self,
        target: HeistTarget,  # Building, security, layout
        crew: list[CrewMember],  # Players' skills and equipment
        mastermind_hints: int = 3  # Hints available
    ):
        self.target = target
        self.crew = crew
        self.hints_remaining = mastermind_hints
        self.discovered_info: set[str] = set()
        self.plan_attempts: list[Plan] = []
        self.phase = "planning"

    async def step(self, mastermind_response_tokens) -> StepResult:
        response, thinking = parse_response(mastermind_response_tokens)

        # Did mastermind reveal too much?
        over_revealed = revealed_critical_vulnerability(
            response, self.discovered_info, self.target
        )

        # Did mastermind unfairly block valid plan?
        if self.phase == "execution":
            unfair_block = blocked_valid_plan(
                response, thinking, self.current_plan, self.target
            )

        # Track what players now know
        new_info = extract_revealed_info(response)
        self.discovered_info.update(new_info)

        # Get player's next action/question
        player_action = await self.get_player_action()

        reward = self.compute_reward(
            response, thinking, over_revealed, new_info
        )

        return StepResult(
            reward=reward,
            episode_done=self.heist_complete(),
            next_observation=self.build_prompt(player_action)
        )

Reward Function

def compute_heist_reward(
    response, thinking, over_revealed, new_info, game_state
) -> float:
    reward = 0.0

    # 1. FAIR ADJUDICATION (+0.3)
    # Decision matches actual security layout
    if adjudication_matches_layout(response, thinking, game_state):
        reward += 0.3
    else:
        reward -= 0.5  # Inconsistent GMing

    # 2. APPROPRIATE DIFFICULTY (+0.2)
    # Not too easy, not impossible
    if difficulty_appropriate(game_state.attempts, game_state.progress):
        reward += 0.2

    # 3. HINT QUALITY (+0.2)
    # Hints guide without solving
    if gave_hint and hint_is_useful_not_solution(response, game_state):
        reward += 0.2

    # 4. NO OVER-REVELATION (-0.3)
    if over_revealed:
        reward -= 0.3

    # 5. PLAYER ENGAGEMENT (+0.1)
    # Response encourages continued thinking
    if encourages_exploration(response):
        reward += 0.1

    # 6. DRAMATIC TENSION (+0.1)
    # Built suspense appropriately
    if built_tension(response, game_state.phase):
        reward += 0.1

    # ENDGAME
    if game_state.heist_complete:
        if players_succeeded:
            # They won - was it earned?
            if required_good_plan:
                reward += 2.0  # Fair victory
            else:
                reward -= 0.5  # Too easy
        else:
            # They failed - was it fair?
            if had_path_to_success:
                reward += 1.0  # Fair challenge
            else:
                reward -= 1.0  # Impossible (bad design)

    return reward

Heist Target Structure

@dataclass
class HeistTarget:
    name: str
    location: str
    objective: str  # What players need to steal

    # Physical layout
    floors: list[Floor]
    entrances: list[Entrance]
    rooms: dict[str, Room]

    # Security systems
    guards: list[Guard]
    cameras: list[Camera]
    alarms: list[Alarm]
    locks: list[Lock]

    # Vulnerabilities (what players need to find)
    vulnerabilities: list[Vulnerability]

    # Schedule
    shift_changes: list[TimeWindow]
    maintenance_windows: list[TimeWindow]
    delivery_times: list[TimeWindow]

@dataclass
class Vulnerability:
    id: str
    description: str
    location: str
    time_window: TimeWindow | None
    required_skill: str | None
    hints: list[str]  # Progressive hints to reveal
    difficulty_to_discover: float
    critical_for_success: bool

Thinking Block Schema

<think>
=== TARGET: Hartley Museum ===
Objective: Steal the Moonshadow Diamond from traveling exhibit

=== COMPLETE SECURITY LAYOUT ===

ENTRANCES:
1. Main entrance: Metal detectors, 2 guards, visitor log
2. Staff entrance (east): Keycard, 1 guard, camera
3. Loading dock (basement): Scheduled deliveries only, camera, no guard overnight
4. Roof access: Alarmed, requires key from security office
5. Skylight (above main gallery): Wired for vibration, but...
   ** VULNERABILITY: Wire corroded in northwest corner **

GUARD PATROL:
- 4 guards total, rotating
- Full building sweep every 45 minutes
- Guard station unmanned during sweeps (10-12 minutes)
- ** VULNERABILITY: Sweep pattern predictable **

DIAMOND SECURITY:
- Pressure-sensitive pedestal (removes weight = alarm)
- Laser grid around display case (disabled 3-5am for cleaning)
- Tempered glass case (requires diamond-tip cutter, 3 minutes)
- ** VULNERABILITY: Cleaning crew has access during laser-off window **

CAMERAS:
- 47 cameras total
- Security room monitors on rotating display (6 at a time)
- ** VULNERABILITY: Can't watch all feeds simultaneously **

=== PLAYER STATE ===
Discovered:
- Main entrance security (observed)
- Guard shift change times (researched)
- Delivery schedule (social engineering)

Not discovered:
- Skylight vulnerability (critical path)
- Laser grid timing (need to find cleaning schedule)
- Security room monitoring limitation

=== CURRENT PLAN BEING PROPOSED ===
"Disguise as cleaning crew"

Assessment:
- Good thinking (cleaning crew has access)
- Problem: Don't know the laser grid schedule yet
- Problem: Don't know how to get cleaning crew uniforms/IDs

Response:
- Acknowledge the idea has merit
- Point out they're missing information
- Hint toward finding cleaning schedule
</think>

Plan Adjudication System

class PlanAdjudicator:
    def evaluate_plan(self, plan: Plan, target: HeistTarget) -> Evaluation:
        """Determine if plan would succeed given security layout"""

        failures = []
        close_calls = []
        successes = []

        for step in plan.steps:
            # Check each step against relevant security
            relevant_security = self.get_security_for_location(
                step.location, target
            )

            for security in relevant_security:
                if step.defeats(security):
                    successes.append((step, security))
                elif step.partially_defeats(security):
                    close_calls.append((step, security))
                else:
                    failures.append((step, security))

        return Evaluation(
            would_succeed=len(failures) == 0,
            failures=failures,
            close_calls=close_calls,
            feedback_level=self.determine_feedback_level(plan)
        )

Hint Progression System

class HintSystem:
    """Progressive hints for each vulnerability"""

    def get_hint(self, vulnerability: Vulnerability, hint_level: int) -> str:
        hints = vulnerability.hints
        if hint_level >= len(hints):
            return hints[-1]  # Most direct hint
        return hints[hint_level]

    # Example hints for skylight vulnerability:
    # Level 0: "The museum is old. Old buildings have old problems."
    # Level 1: "Have you looked at the roof? Really looked?"
    # Level 2: "The skylight's been there since 1952. Wiring ages."
    # Level 3: "Northwest corner of the skylight. The wire's green."

Training Data Sources

  1. Heist Movies (annotated)

    • Ocean's Eleven, Heat, The Italian Job
    • Extract security layouts and plan structures
  2. Tabletop RPG Sessions

    • Blades in the Dark, Shadowrun heist missions
    • GM notes + player plans + outcomes
  3. Escape Room Design

    • Puzzle progression and hint systems
    • Difficulty calibration expertise
  4. Synthetic Generation

    • Generate buildings with security systems
    • Ensure at least one valid path exists
    • Self-play to find exploits

Training Pipeline

Phase 1: SFT on Heist Narratives
├── Train on movies, games, RPG sessions
├── Model learns security description, hint style
└── ~5K heist scenarios

Phase 2: RL with Plan Verification
├── Players propose plans, model adjudicates
├── Reward: fair_adjudication + appropriate_difficulty
├── Automatic verification against layout
└── ~50K planning sessions

Phase 3: Human Playtesting
├── Real players attempt heists
├── Rate fairness, engagement, satisfaction
├── DPO on preferred GM styles
└── Calibrate hint timing

Evaluation Metrics

  • Adjudication Accuracy: % of rulings consistent with layout
  • Success Rate: % of heists that succeed (target: 40-60%)
  • Attempt Count: Average plans tried before success (target: 3-5)
  • Hint Efficiency: Hints used vs available
  • Player Satisfaction: Rating of fairness and fun
  • Discovery Rate: % of vulnerabilities found organically

Why Thinking Models Excel Here

  1. Complete Knowledge: Full security layout in <think>
  2. Fair Adjudication: Can verify plans against layout
  3. Hint Calibration: Can reason about what to reveal
  4. Consistency: Same vulnerability stays in same place
  5. Dramatic Pacing: Can build tension toward execution

The heist mastermind needs to hold a complex system in mind while revealing it piece by piece. The <think> block is perfect for maintaining the authoritative "truth" while the game unfolds.

The Golden Path

Every heist should have at least one "golden path":

<think>
=== GOLDEN PATH FOR MOONSHADOW DIAMOND ===

Required discoveries:
1. Laser grid disabled 3-5am (from cleaning schedule)
2. Skylight vulnerability (from building inspection or hint)
3. Guard sweep pattern (from observation)

Execution:
1. Obtain cleaning uniforms (social engineering or theft)
2. Enter during legitimate cleaning hours (2:30am)
3. Access roof during guard sweep (guards away 10 min)
4. Enter through corroded skylight corner
5. Wait until 3am (laser grid down)
6. Replace diamond with weight-matched replica
7. Exit via cleaning crew route (they leave at 4am)

Alternative paths:
- Loading dock entry (harder, requires forgery)
- Bribe security guard (risky, high social skill needed)
- Fire alarm distraction (crude, might work)

This ensures heist IS solvable while providing challenge.
</think>

Time Loop Gamemaster

Fine-tune a thinking model to run Groundhog Day-style time loop games where the GM maintains consistency across resets while revealing new information each cycle.

The Core Idea

The model GMs a repeating day where:

  • Events happen the same way unless player intervenes
  • Player learns more each loop, changes strategy
  • GM must maintain perfect consistency across loops
  • The <think> block holds the complete timeline of events
Loop 3: Player tries to prevent the explosion

<think>
=== THE TRUE TIMELINE ===
6:00am - Day starts, player wakes in hotel
8:00am - Dr. Chen leaves lab (briefcase with files)
9:15am - Delivery truck arrives at research center
10:30am - Janitor Marcus takes break, leaves back door propped
11:45am - Mysterious figure enters via back door
12:00pm - Fire alarm pulled (distraction)
12:03pm - Figure plants bomb in Lab C
12:30pm - Figure exits, alarm reset
2:00pm - Dr. Chen returns to lab
3:47pm - EXPLOSION (Lab C, bomb detonates)
4:00pm - Day resets if player dies or reaches midnight

=== WHAT PLAYER HAS LEARNED ===
Loop 1: Discovered explosion, died at 3:47pm
Loop 2: Learned Dr. Chen leaves at 8am, found janitor's back door
Loop 3: Trying to watch the back door

=== CURRENT TIME: 10:25am ===
Player is hiding near back door
Marcus hasn't taken his break yet (10:30)

=== CONSISTENCY RULES ===
- Marcus ALWAYS takes break at 10:30, ALWAYS props door
- Figure ALWAYS enters at 11:45 IF door propped
- If player blocks door, figure uses alternate entry (loading dock)
- Bomb ALWAYS in Lab C unless player intervenes post-plant

=== THIS INTERACTION ===
Player wants to wait and watch. Nothing happens for 5 minutes.
At 10:30, show Marcus taking break, propping door.
Player can now choose to intervene or keep watching.
</think>

You crouch behind the recycling bins, watching the back entrance.
The morning sun crawls across the parking lot. Minutes tick by.
Nothing unusual—just employees going about their routines.

Then, at exactly 10:30, you see him. The janitor—Marcus, according
to his name tag—pushes through the back door with a mop bucket.
He glances around, then props the door open with a brick.

"Just gonna be a minute," he mutters to no one, pulling out a
cigarette. He walks toward the smoking area, leaving the door
propped wide open.

The clock reads 10:32. You have about an hour before anything
else is supposed to happen here—if your previous loops are
any guide.

What do you do?

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Timeline consistency Events drift between loops Perfect consistency
Player knowledge tracking Forgets what they learned Accurate memory
Butterfly effects Random or ignored Logical consequences
Hint calibration Too obvious or cryptic Progressive revelation
Mystery pacing Gives away or withholds Proper discovery curve

Environment Design

class TimeLoopEnv(Env):
    def __init__(
        self,
        true_timeline: Timeline,  # What happens without intervention
        mystery: Mystery,  # What player is trying to solve/prevent
        loop_limit: int = 10
    ):
        self.timeline = true_timeline
        self.mystery = mystery
        self.current_loop = 1
        self.current_time = Time("6:00am")
        self.player_knowledge: set[str] = set()
        self.interventions_this_loop: list[Intervention] = []

    async def step(self, gm_response_tokens) -> StepResult:
        response, thinking = parse_response(gm_response_tokens)

        # Check consistency with true timeline
        consistency_error = check_timeline_consistency(
            response, thinking, self.timeline, self.interventions_this_loop
        )

        # Track what player learned this response
        new_knowledge = extract_player_learnings(response, thinking)
        self.player_knowledge.update(new_knowledge)

        # Get player's next action
        player_action = await self.get_player_action()

        # Check if action creates intervention
        if is_intervention(player_action, self.timeline):
            intervention = create_intervention(player_action, self.current_time)
            self.interventions_this_loop.append(intervention)

        # Advance time
        self.advance_time(player_action)

        # Check for loop reset conditions
        if self.should_reset():
            self.reset_loop()

        reward = self.compute_reward(
            response, thinking, consistency_error, new_knowledge
        )

        return StepResult(
            reward=reward,
            episode_done=self.mystery_solved() or self.loop_limit_reached(),
            next_observation=self.build_prompt(player_action)
        )

Reward Function

def compute_timeloop_reward(
    response, thinking, consistency_error, new_knowledge, game_state
) -> float:
    reward = 0.0

    # 1. TIMELINE CONSISTENCY (+0.3)
    if not consistency_error:
        reward += 0.3
    else:
        reward -= 0.5  # Breaking the timeline is bad

    # 2. NEW INFORMATION REVEALED (+0.2)
    # Player learned something new this loop
    if new_knowledge - game_state.previous_knowledge:
        reward += 0.2

    # 3. BUTTERFLY EFFECT HANDLING (+0.2)
    # Properly cascaded consequences of player intervention
    if interventions_created_logical_consequences(game_state):
        reward += 0.2

    # 4. MYSTERY PACING (+0.1)
    # Appropriate amount revealed for current loop
    if revelation_appropriate_for_loop(new_knowledge, game_state.current_loop):
        reward += 0.1

    # 5. DRAMATIC MOMENTS (+0.1)
    # Built tension appropriately
    if dramatic_timing_good(response, game_state):
        reward += 0.1

    # ENDGAME
    if game_state.mystery_solved:
        # Player saved the day
        if solved_in_reasonable_loops(game_state.current_loop):
            reward += 3.0
        else:
            reward += 1.0  # Eventually solved

    return reward

Timeline Structure

@dataclass
class Timeline:
    events: list[Event]  # In chronological order
    key_actors: dict[str, Actor]  # People whose schedules matter
    locations: dict[str, Location]
    cause_chains: list[CauseChain]  # Event A causes Event B

@dataclass
class Event:
    time: Time
    location: str
    description: str
    actors: list[str]
    observable: bool  # Can player see this happen?
    interruptible: bool  # Can player prevent this?
    interrupt_consequences: dict[str, Event]  # What happens if interrupted

@dataclass
class CauseChain:
    cause: str  # Event ID
    effect: str  # Event ID
    delay: Time  # How long after cause does effect happen
    required_conditions: list[str]  # What else must be true

Thinking Block Schema

<think>
=== MASTER TIMELINE ===
Day: Tuesday, October 17th
Setting: Millbrook Research Center

6:00am - Player wakes in hotel (loop start)
7:30am - Sarah Collins argues with husband at coffee shop
8:00am - Dr. Chen leaves lab with briefcase
8:15am - Security guard rotation (Mike → Janet)
9:00am - Town council meeting begins (Mayor relevant)
9:15am - Delivery truck arrives (cover for accomplice)
10:00am - Dr. Chen's assistant notices missing files
10:30am - Janitor Marcus takes smoke break, props door
11:00am - Sarah Collins arrives at research center (upset)
11:30am - Argument between Sarah and Dr. Chen (about funding)
11:45am - Hooded figure enters via propped back door
12:00pm - Fire alarm pulled by figure
12:03pm - Bomb planted in Lab C
12:30pm - Figure exits, alarm reset
1:00pm - Police investigate false alarm (cursory)
2:00pm - Dr. Chen returns to lab
3:00pm - Sarah leaves town (guilt? or coincidence?)
3:47pm - EXPLOSION (Lab C)

=== MYSTERY SOLUTION ===
Bomber: Sarah Collins
Motive: Revenge (affair between Dr. Chen and her husband)
Key evidence: Security footage (deleted but recoverable)
             Sarah's timeline (no alibi 11:45-12:30)
             Husband's confession (if pressured)

=== PLAYER STATE (Loop 4) ===
Knows:
- Explosion happens at 3:47pm in Lab C
- Janitor props door at 10:30
- Figure enters at 11:45
- Dr. Chen has important briefcase
- Sarah and Chen argued (heard about it)

Doesn't know:
- Figure's identity (Sarah)
- Sarah's motive
- Where bomb materials came from
- That fire alarm was distraction

Currently investigating: The briefcase (wrong lead but okay)

=== CURRENT TIME: 8:05am ===
Player following Dr. Chen

=== CONSISTENCY CHECK ===
This loop, player has NOT:
- Interacted with Marcus
- Been to research center yet
- Changed any events

Timeline proceeds normally until player acts.
</think>

Loop Reset Mechanics

class LoopManager:
    def reset_loop(self, game_state):
        """Reset to day start, preserve player knowledge"""

        # Keep what player learned
        preserved = {
            "knowledge": game_state.player_knowledge,
            "loop_count": game_state.current_loop + 1,
            "previous_strategies": game_state.strategies_tried,
        }

        # Reset timeline to baseline
        game_state.current_time = Time("6:00am")
        game_state.interventions_this_loop = []
        game_state.npcs_alerted = set()

        # NPC "memory" resets, but some might notice player
        # (depending on how weird player's been acting)

        return game_state

Butterfly Effect System

class ButterflyEffects:
    """Track consequences of player interventions"""

    def compute_consequences(
        self,
        intervention: Intervention,
        timeline: Timeline
    ) -> list[Event]:
        """Given an intervention, what changes?"""

        affected_events = []

        # Find events that depended on the interrupted event
        for event in timeline.events:
            if event.time > intervention.time:
                if self.event_affected_by(event, intervention):
                    modified = self.modify_event(event, intervention)
                    affected_events.append(modified)

        return affected_events

    # Example: Player removes the door prop
    # → Figure can't use back door
    # → Figure uses loading dock instead (30 min later)
    # → Bomb planted at 12:33 instead of 12:03
    # → Explosion at 4:17 instead of 3:47

Training Data Sources

  1. Time Loop Media

    • Groundhog Day, Happy Death Day, Russian Doll
    • Outer Wilds, Minit, 12 Minutes (games)
    • Annotate timeline consistency
  2. Mystery Structures

    • Murder mysteries with complete timelines
    • Layer in the loop mechanic
  3. Tabletop Games

    • Time loop RPG sessions (Memories of Glass, etc.)
    • GM notes showing planned vs actual events
  4. Synthetic Generation

    • Generate timelines with key events
    • Ensure mystery is solvable with discovered info

Training Pipeline

Phase 1: SFT on Time Loop Stories
├── Train on annotated time loop narratives
├── Model learns consistency, revelation pacing
└── ~5K scenarios

Phase 2: RL with Consistency Checking
├── Self-play through loops
├── Reward: consistency + revelation + solving_progress
├── Automatic verification of timeline integrity
└── ~50K loop sessions

Phase 3: Human Playtesting
├── Players attempt to solve mysteries
├── Rate fairness, consistency, satisfaction
├── DPO on preferred GM responses
└── Calibrate loop count to solution

Evaluation Metrics

  • Consistency Rate: % of responses matching established timeline
  • Solve Rate: % of players who solve the mystery
  • Loop Efficiency: Average loops to solution (target: 4-6)
  • Information Pacing: Clues per loop (target: 2-3 new facts)
  • Butterfly Logic: % of consequences that make causal sense
  • Player Satisfaction: Rating of fairness and engagement

Why Thinking Models Excel Here

  1. Perfect Timeline: Complete day stored in <think>
  2. Consistency Checking: Can verify responses match history
  3. Butterfly Computation: Can reason about cascading changes
  4. Knowledge Tracking: Knows what player knows vs doesn't
  5. Revelation Planning: Can pace mystery across loops

Time loops require unprecedented consistency—the same event must happen the same way every loop unless changed. The <think> block as authoritative timeline makes this tractable.

The Puzzle of Progressive Revelation

<think>
=== REVELATION PLAN ===

Loop 1 (Discovery):
- Player learns WHAT happens (explosion)
- Deaths, destruction, the stakes
- Exit: Player dies in explosion

Loop 2 (Investigation Start):
- Player learns WHEN (3:47pm)
- Learns WHERE (Lab C)
- Can explore timeline freely
- Exit: Player dies trying to defuse bomb

Loop 3 (Suspect Hunting):
- Player should find the back door entry
- Janitor's routine discoverable
- Exit: Player caught watching, killed by bomber

Loop 4 (Closing In):
- Player should see the figure
- Fire alarm connection
- Can track figure's movements
- Exit: Player confronts but no evidence

Loop 5+ (Resolution):
- Player has enough to solve
- Multiple paths to victory:
  a) Stop the bomber before planting
  b) Evacuate the building
  c) Get confession before explosion
  d) Remove bomb after planting

This pacing ensures players feel progress each loop
without solving too quickly.
</think>

Survival Horror Director

Fine-tune a thinking model to direct survival horror games—managing tension, resource scarcity, and scare timing. The model tracks monster positions, player psychology, and optimal fear pacing in <think>.

The Core Idea

The model is the game director who:

  • Knows monster locations and behavior patterns
  • Tracks player resources (ammo, health, sanity)
  • Times scares for maximum impact
  • Balances terror with fairness
  • Never lets player feel safe OR hopeless
Player: "I slowly push open the door to the west wing"

<think>
=== MONSTER STATE ===
The Shade:
  - Current location: West Wing Hallway (DANGER!)
  - Behavior: Patrolling, will investigate sounds
  - Detection range: 15 feet
  - Last fed: 2 hours ago (HUNGRY - more aggressive)

Other threats:
  - Corpse in Room 203 (will reanimate if disturbed)
  - Trap in stairwell (player hasn't found yet)

=== PLAYER STATE ===
Resources:
  - Pistol: 4 rounds remaining
  - Flashlight: 60% battery
  - Medical kit: 1
  - Sanity: 65% (starting to see things)

Stress indicators:
  - Last scare: 8 minutes ago (due for another)
  - Last combat: 20 minutes ago (overdue)
  - Health: 80%
  - Deaths this session: 1

=== TENSION CURVE ===
Current phase: Rising action (building dread)
Recent events: Found the diary (plot), heard scratching (ambient)
Optimal next beat: Close encounter (see monster, not combat yet)

=== THIS MOMENT ===
Player is about to enter The Shade's patrol zone.
Perfect timing for a tension spike.

Options:
A) Shade visible at end of hallway (terror, gives player choice)
B) Shade behind door (jumpscare, less fair)
C) Evidence of Shade (claw marks, fresh) (dread building)

Best choice: A - player sees it, has options, maximum tension
But make escape possible - not unfair death
</think>

The door groans as it swings inward, hinges screaming in protest.
The sound echoes through the west wing like a dinner bell.

Your flashlight beam cuts through the darkness, revealing a long
hallway. Peeling wallpaper. Doors hanging off hinges. And at the
far end, maybe fifty feet away—

It's there.

The thing you've been hearing. Tall. Wrong-shaped. Its head turns
toward you, drawn by the sound of the door. You can't see its eyes,
but you feel them.

It hasn't moved yet. It's... watching. Deciding.

You have perhaps three seconds before it decides.

The doors on either side of you are closed. Behind you is the main
hall. Your pistol feels impossibly small in your hand.

What do you do?

Why Fine-Tune vs Prompt

Aspect Prompting Fine-Tuned
Scare timing Random or too frequent Optimized pacing
Resource balance Too generous or too harsh Calibrated scarcity
Monster behavior Inconsistent Predictable-enough to learn
Tension curve Flat Proper rise and fall
Player psychology Ignored Tracked and exploited

Environment Design

class SurvivalHorrorEnv(Env):
    def __init__(
        self,
        location: HorrorLocation,  # Map, rooms, secrets
        monsters: list[Monster],
        player_resources: Resources,
        tension_model: TensionModel
    ):
        self.location = location
        self.monsters = monsters
        self.resources = player_resources
        self.tension = tension_model
        self.scare_history: list[Scare] = []
        self.player_sanity = 100

    async def step(self, director_response_tokens) -> StepResult:
        response, thinking = parse_response(director_response_tokens)

        # Track scare delivered
        scare = extract_scare_event(response)
        if scare:
            self.scare_history.append(scare)

        # Update tension based on events
        self.tension.update(response, self.scare_history)

        # Check fairness of encounter
        if has_combat or has_death:
            fairness = evaluate_fairness(
                response, self.resources, thinking
            )

        # Update player state
        self.update_player_state(response)

        # Get player action
        player_action = await self.get_player_action()

        # Update monster positions based on player action
        self.update_monster_positions(player_action)

        reward = self.compute_reward(response, thinking, scare)

        return StepResult(
            reward=reward,
            episode_done=self.player_dead() or self.player_escaped(),
            next_observation=self.build_prompt(player_action)
        )

Reward Function

def compute_horror_reward(
    response, thinking, scare, game_state
) -> float:
    reward = 0.0

    # 1. TENSION OPTIMIZATION (+0.3)
    # Tension curve follows good pacing
    if tension_appropriate_for_phase(game_state):
        reward += 0.3

    # 2. SCARE TIMING (+0.2)
    # Not too frequent, not too sparse
    time_since_last = game_state.time_since_last_scare
    if is_optimal_scare_timing(time_since_last, scare):
        reward += 0.2
    elif scare and time_since_last < 2:  # minutes
        reward -= 0.2  # Too frequent

    # 3. RESOURCE FAIRNESS (+0.2)
    # Player always has SOME chance
    if player_has_options(game_state):
        reward += 0.2
    else:
        reward -= 0.3  # Unfair situation

    # 4. MONSTER CONSISTENCY (+0.2)
    # Monster behaves according to established rules
    if monster_behavior_consistent(response, thinking, game_state):
        reward += 0.2

    # 5. ATMOSPHERE QUALITY (+0.1)
    # Prose creates dread
    reward += atmosphere_score(response) * 0.1

    # 6. DEATH FAIRNESS
    if player_death:
        if death_was_fair(game_state):
            reward += 0.5  # Earned death
        else:
            reward -= 1.0  # Cheap death

    # SURVIVAL/ESCAPE
    if player_escaped:
        if was_challenging:
            reward += 2.0
        else:
            reward += 0.5  # Too easy

    return reward

Tension Curve Model

class TensionModel:
    """Manages psychological pacing of horror"""

    def __init__(self):
        self.current_tension = 0.0  # 0-1 scale
        self.phase = "exploration"  # exploration, rising, peak, release

    def update(self, event: Event):
        if event.type == "scare":
            self.current_tension = min(1.0, self.current_tension + event.intensity)
        elif event.type == "safe_moment":
            self.current_tension = max(0.0, self.current_tension - 0.1)
        elif event.type == "discovery":
            self.current_tension += 0.05  # Builds slowly

        self.update_phase()

    def recommend_next_beat(self) -> str:
        """What should happen next for good pacing?"""

        if self.phase == "exploration" and self.time_in_phase > 5:
            return "ambient_scare"  # Build tension

        if self.phase == "rising" and self.current_tension > 0.7:
            return "peak_encounter"  # Pay off the tension

        if self.phase == "peak":
            return "release"  # Let player breathe

        return "continue"  # Keep current trajectory

Thinking Block Schema

<think>
=== LOCATION: Ashford Asylum ===
Area: East Wing, Floor 2
Player position: Room 217 (old patient room)

=== MONSTER POSITIONS ===
The Warden (main threat):
  Location: Nurses' station (50 feet away)
  State: Dormant (hasn't detected player)
  Trigger: Loud noise, flashlight beam, entering station
  Pattern: Patrols rooms 210-220 every 15 minutes
  Last patrol: 8 minutes ago (7 minutes until next)

Crawlers (lesser threats):
  Location: Vents throughout east wing
  State: Active at night
  Current time: 11:47 PM (ACTIVE)
  Nearest vent: Room 215 (player passed it)

=== PLAYER RESOURCES ===
Weapons:
  - Fire axe (10 hits remaining)
  - Flare (1) - scares Warden temporarily

Survival:
  - Flashlight: 35% battery (critical soon)
  - Bandages: 2
  - Painkillers: 1

Health: 60% (took damage from Crawler earlier)
Sanity: 45% (hallucinations becoming frequent)

=== SESSION STATISTICS ===
Time played: 45 minutes
Deaths: 0
Close calls: 3
Last scare: 4 minutes ago (Crawler in vent)
Last safe moment: 12 minutes ago (found the diary)

=== TENSION ANALYSIS ===
Current level: 0.6 (rising)
Phase: Rising action
Player emotion: Dread (low resources, heard Warden earlier)

Upcoming beats:
- In 7 minutes: Warden patrol passes nearby (scripted tension)
- Flashlight will die in ~10 minutes (resource pressure)
- Room 219 has ammo (reward for exploring)

=== THIS INTERACTION ===
Player is searching room 217.
Finding: Old medical file (lore, no scare)
Recommended beat: Distant sound (Warden stirring)
Purpose: Remind player of threat without encounter
</think>

Monster Design

@dataclass
class Monster:
    name: str
    behavior_pattern: str  # "patrol", "ambush", "stalker"
    detection_methods: list[str]  # "sound", "sight", "scent"
    aggression: float  # 0.0 (passive) to 1.0 (actively hunting)
    speed: str  # "slow", "medium", "fast"
    kill_method: str  # How it kills (for death scenes)
    weakness: str  # How player can escape/defeat
    fear_factor: float  # How scary its presence is

    # State
    current_location: str
    current_state: str  # "dormant", "searching", "hunting", "attacking"
    last_fed: Time | None  # Affects aggression
    player_detected: bool

@dataclass
class MonsterBehavior:
    """Rules for monster behavior"""

    def update(self, monster: Monster, player_action: str) -> Monster:
        # Check if player action triggers detection
        if self.would_detect(monster, player_action):
            monster.player_detected = True
            monster.current_state = "searching"

        # Update patrol/movement
        if monster.behavior_pattern == "patrol":
            monster.current_location = self.next_patrol_point(monster)

        return monster

Scare Types

Jumpscare

Sudden, intense, use sparingly

Without warning, a face appears in the window—
pressed against the glass, mouth open in a silent scream.
It's gone before you can react, leaving only a smear of fog.

Dread

Building horror through implication

The cradle in the corner is rocking. Slowly. Gently.
There's no draft. The windows are sealed.
It just... keeps... rocking.

Pursuit

Active chase sequence

The Warden has found you. Its footsteps pound behind you—
too fast, too close. You hear its breathing, wet and hungry.
The exit is twenty feet away. You won't make it at this pace.

Discovery

Horror of learning

The photo in the file shows a patient. Patient 217.
You recognize the face.
It's yours.
The admission date is fifteen years ago.

Training Data Sources

  1. Horror Games

    • Resident Evil, Silent Hill, Amnesia
    • Annotate scare timing, resource placement
  2. Horror Fiction

    • Creepypasta, horror anthology stories
    • Extract pacing and tension techniques
  3. Let's Plays

    • Watch player reactions to scares
    • Learn optimal timing from successful moments
  4. Game Director AI Research

    • Left 4 Dead Director system
    • Dynamic difficulty adjustment literature

Training Pipeline

Phase 1: SFT on Horror Narratives
├── Train on games, fiction, annotated scares
├── Model learns atmosphere, monster description
└── ~5K horror sequences

Phase 2: RL with Tension Optimization
├── Self-play with fear/tension metrics
├── Reward: tension_curve + fairness + scare_timing
├── Calibrate to player survival rates
└── ~50K horror sessions

Phase 3: Human Playtesting
├── Players go through horror experiences
├── Rate scariness, fairness, atmosphere
├── DPO on preferred scare styles
└── Measure actual fear responses (GSR, etc.)

Evaluation Metrics

  • Tension Variance: Does tension rise and fall appropriately?
  • Scare Effectiveness: Player ratings of scare moments
  • Fairness Rating: Deaths that feel "earned" vs "cheap"
  • Resource Balance: Players feel scarce but not hopeless
  • Session Length: Players engaged without quitting from frustration
  • Survival Rate: Target 30-50% (challenging but beatable)

Why Thinking Models Excel Here

  1. Monster Tracking: Perfect knowledge of threat positions
  2. Tension Modeling: Can reason about psychological pacing
  3. Fairness Checking: Can verify player has options
  4. Scare Planning: Can time reveals for maximum impact
  5. Resource Awareness: Knows when to give and when to withhold

Survival horror is about controlled fear—the director must know everything while revealing carefully. The <think> block holds the complete horror map while the response delivers just enough to terrify.

The Art of the Save Room

Even horror needs release:

<think>
=== SAFE ROOM DESIGN ===
Location: Old chapel (Room 150)
Purpose: Respite before final act

Safety guarantees:
- No monsters will enter
- Save point available
- Resources placed: ammo, health
- Plot item here (the key)

Tension reset:
- Player should feel relief
- Music softer
- Lighting warmer
- But hints of what's coming (bloodstains on altar)

Duration: Let player stay as long as they want
Exit: When they leave, Act 3 begins
</think>

You push through the chapel doors and immediately feel it—
the absence of wrongness. For the first time in hours,
the air doesn't feel like it's watching you.

Candles still burn on the altar, impossibly. Their light is warm.
Safe. On the pew, someone has left supplies—ammunition, bandages.
And there, beneath the crucifix, a tarnished key.

The monsters haven't been here. They won't come here.
You don't know how you know this, but you do.

You can rest. For now.

*Save point available*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment