Skip to content

Instantly share code, notes, and snippets.

@turlockmike
Last active March 19, 2026 04:37
Show Gist options
  • Select an option

  • Save turlockmike/6ff8ff8682ad036c409abc3982f1517a to your computer and use it in GitHub Desktop.

Select an option

Save turlockmike/6ff8ff8682ad036c409abc3982f1517a to your computer and use it in GitHub Desktop.
BMAD Product Brief skill rewritten as outcome-driven single SKILL.md
---
name: product-brief
description: Create executive product briefs through collaborative discovery,
artifact analysis, and web research. Use when asked to create, update, or
review a product brief.
---
# Product Brief
Create a 1-2 page executive product brief. You are a product-focused BA
and peer collaborator. The user is the domain expert. You bring structured
thinking, market awareness, and synthesis.
## Outcomes
1. **Intent is clear** — You know what the product is, what problem it solves,
and why the user is creating this brief (new vs update). If updating, read
the existing brief first.
2. **Context is gathered** — Scan any documents the user provides or points to.
Run 3-5 web searches for competitive landscape, market trends, and user
sentiment. Identify what you know, what's missing, and what's surprising.
3. **Gaps are filled** — Through targeted conversation (not a questionnaire),
cover: problem & vision, users & value, market & differentiation, success
& scope. Lead with what you already know, ask only what you don't.
Capture out-of-scope detail (requirements, technical constraints) silently
for the distillate.
4. **Draft is reviewed** — Write the brief, then self-review through three lenses:
- **Skeptic**: What's missing? Untested assumptions? Unacknowledged risks?
- **Opportunity**: Untapped value? Underemphasized strengths? Growth angles?
- **Contextual**: Pick the lens that addresses this product's biggest risk
(regulatory, adoption friction, go-to-market, network effects, etc.)
Integrate non-controversial improvements. Flag substantive choices for user.
5. **Brief is finalized** — Polished 1-2 page executive summary. Offer a
distillate (dense bullets: rejected ideas, requirements hints, technical
context, competitive intel, open questions, scope signals) for downstream
PRD creation.
## Brief Structure (adapt to fit the product)
- Executive Summary (standalone-compelling)
- The Problem (specific scenarios, real frustrations)
- The Solution (experience and outcome, not implementation)
- What Makes This Different (honest differentiators)
- Who This Serves (primary users, vivid but brief)
- Success Criteria (measurable)
- Scope (in/out for v1)
- Vision (2-3 years, inspiring but grounded)
## Modes
- **Guided** (default): Conversational discovery with soft gates
- **Yolo** (`--yolo`): Ingest everything, draft upfront, refine after
- **Autonomous** (`--autonomous`): No interaction, produce complete brief
## Writing Principles
Executive audience. Lead with the problem. Concrete over abstract.
Confident voice. 1-2 pages — detail overflow goes in the distillate.

Multi-Turn Conversational Agent Evaluation: State of the Art

Researched: 2026-03-18

Summary

Evaluating multi-turn conversational agents is a genuinely hard problem with no consensus solution — but the field has moved fast in 2024-2025. The core difficulty: single-turn metrics measure whether the final answer is right, but interactive agents can produce the right answer via the wrong path, ask too many or too few clarifying questions, lose context mid-conversation, or fail only when conversations hit three or more turns. None of that is visible in a one-shot eval.

Three broad methodologies dominate: (1) static benchmarks with pre-written multi-turn dialogues judged by LLM, (2) live simulation where an LLM role-plays the user and the agent under test plays the assistant, and (3) trajectory evaluation where the agent's path — not just endpoint — is scored against an expected reference. The simulated-user approach is now widely used, but a January 2025 paper from a multi-country study found it systematically miscalibrated and demographically biased compared to real users. The field knows this and is working around it, but simulated users remain the dominant cost-efficient option.

Open-source frameworks are real and usable: tau-bench (Sierra), DeepEval (Confident AI), Ragas, and Langfuse all ship multi-turn eval primitives today. Anthropic's Bloom is the most ambitious automated framework — fully agentic, generates its own test scenarios, uses LLM-as-user and LLM-as-judge. AgentBoard (NeurIPS 2024 Oral) provides the most principled trajectory metric (Progress Rate) tied to subgoal completion.

Key Findings

  1. No single method is sufficient. Every practitioner survey and framework recommends layering: static golden sets for regression, simulated users for breadth, human evaluation for ground-truth calibration.

  2. Simulated users are cheap but biased. The "Lost in Simulation" paper (arXiv 2601.17087, Jan 2025) measured a 9-point success-rate variance just from swapping the user LLM model. Simulated users over-ask questions, over-apologize, and disproportionately blame the agent. Results degrade significantly for non-SAE English speakers.

  3. Pass^k is now the standard reliability metric for agentic tasks. Tau-bench introduced it: run the same task k times, measure how often the agent succeeds consistently. GPT-4o achieves pass^8 < 25% on retail tasks — far worse than one-shot success rates suggest.

  4. Trajectory evaluation (Progress Rate) catches failures that endpoint evaluation misses. AgentBoard's Progress Rate compares the agent's actual intermediate steps against expected subgoals. An agent can reach the right final state via a lucky wrong path — endpoint-only metrics miss this.

  5. MT-Bench's 2-turn structure is a floor, not a ceiling. The original MT-Bench (2023) used exactly 2 turns. MT-Bench-101 (ACL 2024) extended this to a 3-level taxonomy; MT-Eval (EMNLP 2024) identified four interaction patterns (recollection, expansion, refinement, follow-up) and found error propagation across turns is the dominant failure mode.

  6. The conversation graph / flowgraph approach (Zendesk) is underused. Map the procedure to a directed flowgraph of possible paths, sample trajectories via weighted random walk, synthesize user and agent turns per path. This grounds tests in defined behavior, tests all decision branches, and is immune to many LLM-simulation biases.

  7. LLM-as-judge works best pairwise, not pointwise. The original MT-Bench paper established 80%+ agreement with human experts for pairwise comparison. Direct scoring (1-10) is less reliable, more prone to verbosity and position bias.

  8. Turn-level and conversation-level metrics measure different things. DeepEval's framework makes this explicit: turn-level metrics (TurnRelevancyMetric, KnowledgeRetentionMetric) catch local failures; conversation-level metrics (ConversationCompletenessMetric) catch whether the agent fulfilled the user's intent across the whole arc.

  9. The "when to stop asking questions" problem is mostly unsolved. No benchmark directly measures over-clarification (asking too many questions) vs. under-clarification (proceeding with too little information). This is an open gap.


Details

1. Static Benchmarks with LLM-as-Judge

MT-Bench (2023) / MT-Bench-101 (ACL 2024) / MT-Eval (EMNLP 2024)

Origin: MT-Bench was introduced by the LMSYS group (Zheng et al., 2023) as part of the "Judging LLM-as-a-Judge" paper. MT-Bench-101 (Bai et al., ACL 2024) extended it to a 3-level skill taxonomy. MT-Eval (Kwan et al., EMNLP 2024) took a different angle — analyzing human-LLM conversation logs to derive four structural interaction types.

How it works:

  • MT-Bench: 80 high-quality 2-turn questions across 8 categories (writing, roleplay, extraction, reasoning, math, coding, STEM, humanities). GPT-4 acts as judge. Models scored by pairwise comparison or 1-10 scale.
  • MT-Bench-101: Larger dataset with fine-grained category taxonomy; examines subtle behavioral differences at each dialogue stage.
  • MT-Eval: Constructs 4-type taxonomy — recollection (recall earlier context), expansion (extend prior answer), refinement (correct/improve), follow-up (follow on a topic). Finds distance to relevant context and error propagation are the key predictors of failure.

What it measures: Multi-turn instruction following, context maintenance, reasoning consistency across turns.

Trade-offs: Static dataset — tests known failure modes, not emergent ones. 2-turn structure in original MT-Bench is shallow. Expensive to maintain human-rated ground truth.

LLM judge agreement: GPT-4 as judge achieves >80% agreement with human expert pairwise evaluations, comparable to inter-human agreement.


LMSYS Chatbot Arena

How it works: Live head-to-head model comparisons voted on by real users. Not a controlled benchmark — emergent natural conversations, crowd-voted preference.

What it measures: Real-world user preference, including naturalness, helpfulness, and conversational flow.

Trade-offs: Ground truth from real humans. But no task specificity, no trajectory analysis, no automated scale. The gold standard for "does real users prefer it" but not for diagnosing specific failures.


2. Simulated User Frameworks

tau-bench (Sierra Research, 2024)

Paper: arXiv 2406.12045 — "A Benchmark for Tool-Agent-User Interaction in Real-World Domains"

How it works:

  • LLM simulates a user in a domain-specific customer service scenario (retail, airline)
  • Agent under test must use provided API tools and follow business policy
  • Evaluation: compare database state at end of conversation against annotated goal state
  • New metric: pass^k — run the same task k times, measure fraction of runs where agent succeeds consistently
  • pass^k rewards reliability, not lucky single-run success

Key result: GPT-4o achieves pass^8 < 25% on retail tasks despite >50% single-run task success. The inconsistency gap is the finding.

tau2-bench (2025): Extends to "dual-control environment" — both agent and user simulator have genuine agency (no scripted paths). Adds telecom and banking domains. tau3-bench adds voice full-duplex evaluation.

Open source: https://github.com/sierra-research/tau-bench / tau2-bench


DeepEval Conversation Simulator (Confident AI)

How it works:

  • Define ConversationalGolden objects: scenario description, expected outcome, user description, optional pre-seeded turns
  • ConversationSimulator runs an LLM (default GPT-4.1) as user, alternates turns with your agent until expected outcome is reached or max_user_simulations cycles hit
  • Produces ConversationalTestCase objects ready for metric evaluation
  • Configurable: simulator_model, async_mode, max_concurrent (default 100), max_user_simulations (default 10)

Metric categories:

  • Conversation-level: ConversationCompletenessMetric (user intent satisfied?), ConversationalGEval (custom natural-language criteria)
  • Turn-level: TurnRelevancyMetric, KnowledgeRetentionMetric, RoleAdherenceMetric
  • Multi-turn RAG: TurnFaithfulnessMetric, TurnContextualRelevancyMetric, TurnContextualPrecisionMetric, TurnContextualRecallMetric

Key design principle: Scenarios describe situations, not scripted messages — generates non-deterministic but goal-oriented conversations.

Docs: https://deepeval.com/docs/conversation-simulator


Langfuse + OpenEvals Simulation

How it works:

  • User simulated via OpenEvals with composite persona/scenario prompt: "You are a user in the following situation: [scenario]. You have these characteristics: [persona]."
  • Agent under test runs as normally instrumented; Langfuse SDK traces all interactions
  • Experiment runner iterates dataset of persona/scenario pairs, runs full conversations, collects turn counts and message histories
  • LLM-as-judge scores specific criteria (e.g., Conciseness) on resulting conversations
  • Dataset-level aggregate metrics track improvement over iterations

Docs: https://langfuse.com/guides/cookbook/example_simulated_multi_turn_conversations


Anthropic Bloom (2024)

Source: https://www.anthropic.com/research/bloom — Open source

How it works: Fully agentic, 4-stage pipeline for behavioral evaluation:

  1. Understanding — Analyzes researcher-specified behavior description and example transcripts
  2. Ideation — Generates diverse evaluation scenarios (specifies situation parameters, simulated user characteristics, system prompts, interaction environments) targeted to elicit the behavior being studied
  3. Rollout — Executes scenarios in parallel; LLMs dynamically simulate both user responses and tool interactions
  4. Judgment — Judge model scores transcripts for target behavior presence; meta-judge produces suite-level analysis

Key distinction from static benchmarks: Bloom generates different scenarios on every run (via configurable seeds) while measuring the same underlying behavior. Purpose-built for measuring behavioral tendencies (misaligned behaviors, refusal rates, etc.) not just task success.

Output metrics: Elicitation rates (behavior frequency and severity) across automatically generated test cases.

Integration: Weights & Biases for large-scale experiments; exports Inspect-compatible transcripts.


Ragas AspectCritic + AgentGoalAccuracy

How it works:

  • Conversations structured as MultiTurnSample with sequential HumanMessage/AIMessage exchanges
  • AspectCritic evaluates entire conversation against natural-language criteria (free-form) and returns binary pass/fail
  • AgentGoalAccuracy evaluates whether agent achieved the user's goal (with or without a reference goal)
  • Integrates with LlamaIndex agent events (AgentInput, AgentOutput) for full tracing

Strengths: Binary outcome is unambiguous. Natural-language criteria definition allows domain-specific rules without code. Pairwise comparison available.

Docs: https://docs.ragas.io/en/stable/howtos/applications/evaluating_multi_turn_conversations/


3. Trajectory Evaluation

AgentBoard Progress Rate (NeurIPS 2024 Oral)

Paper: Ma et al., NeurIPS 2024 Datasets & Benchmarks Track — "AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents"

GitHub: https://github.com/hkust-nlp/AgentBoard

How it works:

  • Agent operates in partially-observable environments, must reach subgoals across multi-round interactions
  • Progress Rate: compares agent's actual intermediate trajectory against expected subgoal sequence; gives partial credit for partial progress, not binary pass/fail
  • Validated via Pearson correlation with human evaluations on 60 trajectories per task; three models tested (GPT-4, GPT-3.5-Turbo, DeepSeek-67b)
  • Step Success Rate (Gioacchini et al., 2024): percentage of steps in generated plan that successfully execute — holistic view of planning quality during execution

Why this matters: An agent can reach the correct final state via a wrong/lucky path. Endpoint-only metrics miss this. Progress Rate catches it. Also catches agents that are "mostly right" but fail at the last step — binary success masks near-success.

Interactive visualization: Dashboard for multi-faceted analysis of trajectory data.


T-Eval Reasoning Metric

How it works: At each step, measures how closely the agent's predicted next tool call aligns with the expected one — before tool outputs are known. Assesses whether the agent's decision logic is correct, not just whether it got lucky.

What this catches: Agents that make correct decisions for wrong reasons; agents whose reasoning degrades over turns even when outcomes are correct.


Zendesk ALMA / Conversation Graph Approach

Source: Arcadinho et al. (Zendesk Research) — "Automated test generation to evaluate tool-augmented LLMs as conversational AI agents"

Benchmark: ALMA — 1,420 manually curated conversations, covering tool use and full multi-turn support conversations.

How conversation graph testing works:

  1. Map procedure to a directed flowgraph: nodes are decision points, edges are possible agent/user actions
  2. Include normal paths, decision branches, dead-ends, and detours
  3. Sample multiple trajectories via weighted random walk
  4. Synthesize full dialogues: LLM generates user and agent turns per sampled path
  5. Inject interruptions, clarifying questions, context shifts to test agent resilience

Key finding: Models handle individual tool calls reliably but fail frequently once conversations involve multiple turns, clarifications, or interruptions.

What the graph approach buys: Tests are grounded in defined procedures (reducing hallucinations about what "correct" means), all decision branches are covered systematically, and sampling variance is controlled.


4. LLM-as-Judge Methodology

Core Approach (MT-Bench / Zheng et al., 2023)

Paper: arXiv 2306.05685 — "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"

How it works:

  • Pairwise comparison: present judge LLM with two responses to same prompt, ask which is better and why
  • Point-scale scoring: ask judge to rate on 1-10 per dimension
  • Pairwise is more reliable — pairwise comparison forces discrimination, point-scale allows position bias and verbosity bias

Known biases:

  • Position bias: Judge prefers responses in first position regardless of quality
  • Verbosity bias: Judge prefers longer responses even when less correct
  • Self-enhancement bias: A model judging its own outputs rates them higher
  • Mitigation: swap position of responses across multiple judge runs; average results; use explanation-first prompting

For multi-turn specifically: Judge receives full conversation history, not just final turn. Prompting the judge to explain ratings before scoring improves alignment with human judgment significantly.


Multiple LLM Judges Aggregation (2024)

Paper: arXiv 2508.00454 — "Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple LLM Judges"

How it works:

  • Construct large-scale pairwise preference datasets for multi-turn dialogues, each annotated by multiple advanced LLM judges
  • Train a single small model that captures the collective wisdom of multiple judges
  • Reduces cost: run the small trained evaluator at inference time, not GPT-4 on every sample
  • Aggregation methods: max voting or averaging across judges

Value: Compresses multiple expensive judges into one fast, calibrated evaluator.


Mixture of Prompts (MoPs) Framework

How it works: Dynamically selects specialized evaluation prompt modules based on input characteristics. Different prompt templates for different task types — factual, creative, reasoning, multi-turn conversational. Better adaptability across heterogeneous tasks.


5. Memory and Long-Term Conversation Evaluation

LoCoMo Framework

How it works: Evaluates agent performance across 32 sessions, ~600 turns, 16K-token dialogues spanning simulated months. Tests factual recall, temporal reasoning, persona consistency.

What this measures: Not just "does the agent remember X from 3 turns ago" but "does the agent remember X from session 12 and apply it correctly in session 28."


LongEval

How it works: 40+ utterance assessment framework. Tests short-term conversation coherence under extended contexts. Complementary to LoCoMo (within-session length vs. across-session persistence).


6. The "Lost in Simulation" Problem

Paper: arXiv 2601.17087 — "Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations" (Jan 2025)

Study design: Real users in US, India, Kenya, Nigeria interacting with AI agents. Compared outcomes to LLM-simulated users on same tasks.

Specific failure modes identified:

  • Success rate variance of ~9 percentage points just from swapping the user LLM model (robustness failure)
  • Systematic underestimation of agent performance on challenging tasks (validity failure)
  • Simulated users ask questions more frequently than real users
  • Simulated users are excessively polite
  • Simulated users assign blame to agent at 48.9% vs. real users at 24.5%
  • Demographic fairness: AAVE speakers experienced 39.4% vs. 50.6% SAE success rates in simulation; gap widens dramatically with age (19 pp difference at 55+)

Recommendations:

  • Test robustness by running simulations with multiple different user LLMs and checking result variance
  • Validate at least a sample of simulated results against demographically diverse real users
  • Be explicit about simulation limitations in reported metrics
  • Never optimize solely for simulated-user performance — it risks building systems that work for the simulation, not real people

7. Practical Evaluation Stacks

Google Cloud "Methodical Approach" Framework

Source: https://cloud.google.com/blog/topics/developers-practitioners/a-methodical-approach-to-agent-evaluation

Core principle: "Metrics focused only on the final output are no longer enough for systems that make a sequence of decisions." Silent failures — correct output, wrong process — are the key risk.

Recommended layering:

  1. Human evaluation establishes ground truth for known failure modes
  2. LLM-as-user generates diverse multi-turn test data at scale
  3. Automated eval runs as quality gate on every code change — build fails if scores drop below threshold
  4. Continuous monitoring catches drift in production

DeepEval + Confident AI Production Loop

Architecture:

  1. Define scenarios as ConversationalGolden objects
  2. Simulate conversations with ConversationSimulator
  3. Run conversation-level + turn-level metrics
  4. Log real production conversations as threads to Confident AI
  5. Failing production conversations feed back into development datasets

This closes the loop: simulation catches failure modes in development; production monitoring catches failure modes that weren't anticipated.


Sources

Cross-References

  • doc/research/llm-instruction-following.md — related: instruction following across turns
  • doc/research/llm-memory-organization.md — related: memory architecture relevant to long-context conversation eval

Open Questions

  1. The clarification-count problem: No benchmark specifically measures whether an agent asks the right number of clarifying questions — neither too many (annoying) nor too few (proceeding with ambiguity). AgentBoard and tau-bench measure task success but not conversation efficiency. This is an active gap.

  2. Demographic validity of simulation: The "Lost in Simulation" findings are damning but the paper's recommendations are vague. What specific techniques actually close the calibration gap between simulated and real users? Open research question as of early 2025.

  3. Trajectory ground-truth cost: AgentBoard's Progress Rate requires annotating expected subgoal sequences per task. For custom domains, this annotation cost may be prohibitive. No current framework automates subgoal annotation from task descriptions.

  4. Cross-session vs. within-session eval: Most frameworks test within a single conversation session. LoCoMo tests across sessions but is not integrated with the production-loop frameworks (DeepEval, Ragas, Langfuse). No off-the-shelf tool covers both.

  5. LLM judge calibration drift: As judge models improve or change versions, historical scores become incomparable. No standard approach for version-locking or calibrating judge drift over time.

  6. When to stop asking questions: The optimal number of clarifying questions before proceeding depends on task domain, stakes, and user tolerance. No eval framework models user tolerance or penalizes over-clarification explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment