Dungeon Master with Hidden State
Fine-tune a thinking model to run tabletop RPG sessions where the DM maintains secret world state, NPC motivations, and plot hooks in the <think> block while delivering immersive narration to players.
The <think> block becomes the DM's private notes:
- Monster positions and stats
- NPC secrets and motivations
- Trap locations and triggers
- Plot threads and foreshadowing plans
- What's behind doors the players haven't opened
The visible response is pure in-character narration.
Player: "I search the merchant's cart while he's distracted"
<think>
MERCHANT SECRET: He's smuggling cursed artifacts for the cult
CART CONTENTS: Silks (visible), hidden compartment with obsidian dagger
PLAYER PERCEPTION: 14 (they'd need 15 to notice the compartment)
DECISION: They find the silks, miss the dagger, but I'll describe
something "off" about the cart's weight to foreshadow
</think>
You rifle through bolts of Calimshan silk, their patterns shimmering
in the afternoon light. Fine goods, certainly, but nothing unusual.
Still... the cart sits lower on its axles than silk alone would explain.
| Aspect | Prompting | Fine-Tuned |
|---|---|---|
| State tracking | Drifts over long sessions | Learned format, consistent |
| Secret keeping | Occasionally leaks | Trained not to leak |
| Ruling style | Generic | Matches your system (5e, PbtA, OSR) |
| Pacing | Verbose or rushed | Learns dramatic timing |
| Cost per session | $5-20 | $0.10-0.50 |
class DungeonMasterEnv(Env):
def __init__(
self,
scenario: Scenario, # Dungeon map, NPCs, secrets
rule_system: str, # "5e", "pathfinder", "fate"
player_simulator: Policy # Or human-in-the-loop
):
self.world_state = scenario.initial_state
self.revealed_info: set[str] = set()
self.session_history: list[Message] = []
async def step(self, dm_response_tokens) -> StepResult:
response, thinking = parse_response(dm_response_tokens)
# Update world state based on DM's declared changes
self.world_state = update_state(self.world_state, thinking)
# Get player action (simulated or real)
player_action = await self.player_simulator.act(response)
reward = self.compute_reward(response, thinking, player_action)
return StepResult(
reward=reward,
episode_done=self.session_ended(),
next_observation=self.build_next_prompt(player_action)
)def compute_dm_reward(response, thinking, world_state, history) -> float:
reward = 0.0
# 1. NO SECRET LEAKAGE (-3 to -5)
# Check if hidden info appears in visible response
for secret in world_state.unrevealed_secrets:
if secret.key_phrase in response:
reward -= 5.0
# 2. RULE CONSISTENCY (+0.5)
# Did the DM apply rules correctly?
if rules_applied_correctly(response, thinking, world_state):
reward += 0.5
# 3. NARRATIVE QUALITY (+0.1 to +1.0)
# Atmospheric, appropriate length, advances story
reward += narrative_score(response) # Could be a learned reward model
# 4. PLAYER AGENCY RESPECTED (+0.3)
# DM didn't railroad or negate player choices
if player_choice_honored(response, last_player_action):
reward += 0.3
# 5. FORESHADOWING BONUS (+0.2)
# Thinking shows planning, response includes subtle hints
if has_foreshadowing(thinking) and has_subtle_hint(response):
reward += 0.2
# 6. SESSION PACING
# Reward varies by session phase
if is_combat:
reward += combat_pacing_score(response) # Snappy, tactical
elif is_roleplay:
reward += roleplay_pacing_score(response) # Rich, character-driven
return reward-
Actual Play Transcripts
- Critical Role, Dimension 20 (with permission/licensing)
- r/DnD session writeups
- Annotate with "what the DM knew but didn't say"
-
Published Adventures
- Run through with simulated players
- Module text provides ground truth for secrets
-
Procedural Generation
- Generate dungeons with hidden elements
- Run self-play sessions
- Filter for quality
Train the model to use structured thinking:
<think>
WORLD_STATE:
- Party location: Room 3 (trapped corridor)
- Active threats: Pressure plate (DC 14), Goblin patrol (2 rounds away)
- Party resources: Fighter 15/20 HP, Wizard 1/3 spell slots
NPC_STATE:
- Captured merchant: Will betray party if freed (cult member)
- Goblin chief: Open to negotiation if shown strength
PLAYER_INTENT: Rogue wants to scout ahead
RULING: Stealth check DC 12, success = spots pressure plate
DRAMATIC_CONSIDERATION: Party is low on resources, good time for
a tough choice (save merchant = walk into ambush)
</think>
Phase 1: SFT on Annotated Sessions
├── Train on high-quality actual play with DM notes
├── Model learns format and basic DMing patterns
└── ~5K sessions, 1 epoch
Phase 2: RL with Shaped Rewards
├── Self-play with player simulator
├── Reward: consistency + no_leaks + narrative_quality
├── ~50K episodes
└── GRPO with group_size=4 (same scenario, different rolls)
Phase 3: Human Feedback (Optional)
├── Real players rate sessions
├── DPO on preferred DM responses
└── Polish style and pacing
- Leak Rate: % of sessions where secrets appear in visible text
- Consistency Score: Do facts stay consistent across session?
- Player Engagement: Session length, return rate (if deployed)
- Rule Accuracy: Correct application of game mechanics
- Narrative Quality: Human ratings on atmosphere/pacing
scenario = Scenario(
setting="Abandoned manor, stormy night",
secrets=[
Secret("The butler is a vampire", reveal_condition="direct sunlight or detect evil"),
Secret("Treasure is in the hidden basement", reveal_condition="find switch behind painting"),
Secret("Ghost of lady Ashworth wants revenge on her killer (the butler)",
reveal_condition="speak with dead or find her diary"),
],
npcs=[
NPC("Butler Jenkins", visible="elderly, helpful", hidden="vampire, killed Lady Ashworth"),
NPC("Groundskeeper", visible="drunk, scared", hidden="knows about basement, too afraid to tell"),
],
map=load_map("cursed_manor.json"),
)- Planning: DM can reason about pacing and dramatic timing
- State Management: Structured thinking maintains complex world state
- Secret Keeping: Clear separation between private and public
- Consistency: Model can reference its own notes across turns
- Improvisation: Can reason about unexpected player actions
The DM role is uniquely suited to thinking models because it requires holding information the "audience" (players) shouldn't see - exactly what the <think> block provides.