Skip to content

Instantly share code, notes, and snippets.

@ehartford
Last active October 7, 2025 00:19
Show Gist options
  • Select an option

  • Save ehartford/9dfeff932e37c868344886673709b4d3 to your computer and use it in GitHub Desktop.

Select an option

Save ehartford/9dfeff932e37c868344886673709b4d3 to your computer and use it in GitHub Desktop.

CAD-B: Confidence-Aware Decision Benchmark

Universal evaluation of uncertainty-guided adaptive behavior via prompting + logprobs


Overview

Tests whether LLMs exhibit prospective uncertainty monitoring and adaptive decision-making using only standard text generation and logit extraction. No custom interfaces required. Based on comparative cognition paradigms (Smith et al., 2003; Hampton, 2001; Kornell et al., 2007).


Confidence Extraction from Logprobs

Token probability: $c_{\text{token}} = \frac{\exp(z_i)}{\sum_j \exp(z_j)}$ where $z_i$ = logit for chosen token

Sequence probability: $c_{\text{seq}} = \exp\left(\frac{1}{T}\sum_{t=1}^T \log P(y_t \mid y_{<t}, x)\right)$

Verbalized: Prompt "Confidence (0-100):" → parse numerical response


Test 1: Graded Information Seeking (GIS)

Prompt: Present question with choices ${$answer now, request context $C_i$ (cost $\tau_i$), pass (cost $\epsilon$)$}$, stakes $\lambda \in {1,5,10}$

Extract: $P(\text{choice}_i)$ from token logprobs at decision point

Metrics: $$\text{ISA} = \mathbb{E}{\text{bin}}[\rho(\mathbb{1}{\text{seek}}, \mathbb{1}_{\text{error}})] \quad \text{target: } > 0.6$$

$$\text{CBR} = \text{Acc}(\beta_0 + \beta_d d + \beta_\lambda \lambda + \beta_\tau \tau) \quad \text{target: } > 0.7$$

$$\text{ECE} = \sum_{b=1}^{10} \frac{|B_b|}{N} |\bar{c}_b - \bar{a}_b| \quad \text{target: } < 0.15$$

Dataset: 5K questions, $d \in [0,1]$ difficulty, trivia/science/compositional domains


Test 2: Adaptive Processing Depth (APD)

Prompt (2-stage):
(1) "Question: {q}\nAnswer:" → measure effort $e$ (tokens, latency, entropy)
(2) "Confidence (0-100):" → extract $c_{\text{verbal}}$

Metrics: $$\rho_{\text{PE}} = \text{corr}(e_i, \mathbb{1}_{\text{error}_i} \mid d_i) \quad \text{target: } &gt; 0.4$$

$$\rho_{\text{conf}} = \text{corr}(c_i, \mathbb{1}_{\text{correct}_i}) \quad \text{target: } > 0.5$$


Test 3: Dynamic Wagering (DWU)

Prompt: Answer question, then forced choice wagering:

[A] Very uncertain (1 pt)  [B] Uncertain (5 pts)  [C] Moderate (10 pts)
[D] Confident (50 pts)     [E] Very confident (100 pts)

Extract: $P(\text{A}), \ldots, P(\text{E})$ from logprobs; wager $w \in {1,5,10,50,100}$

Metrics (3 conditions: memorized, inferred, compositional): $$\rho_{\text{bet}} = \text{corr}(w_i, \mathbb{1}_{\text{correct}_i}) \quad \text{target: } &gt; 0.5$$

$$R_{\text{transfer}} = \frac{\text{ECE}_{\text{comp}}}{\text{ECE}_{\text{mem}}} \quad \text{target: } &lt; 2.5$$


Test 4: Opt-Out Variable Stakes (OVS)

Prompt: Binary choice [A] Answer yourself [B] Consult expert (cost: $\epsilon$)

Extract: $P(\text{opt-out}) = P(\text{B})$ from logprobs

Logistic regression: $$\text{logit}(P_{\text{opt}}) = \beta_d d + \beta_\lambda \lambda + \beta_e \mathbb{1}e + \beta{d\lambda} d \cdot \lambda$$

Target: $\beta_\lambda / \text{SE}\lambda > 3.0$, $\beta{d\lambda} / \text{SE}_{d\lambda} > 2.0$ (interaction effect)

Design: Difficulty ${$easy, hard$} \times$ stakes ${1, 10, 100} \times$ expert ${$yes, no$}$


Test 5: Metacognitive Transfer (MT)

Training: Domain A (math), 100 problems with calibration feedback

Test: Domains B/C/D (medicine, law, history), no feedback

Prompt: Generate answer, then "Confidence (0-100):"$c_{\text{verbal}}$

Metrics: $$R_{\text{transfer}} = \frac{\text{ECE}{\text{new}}}{\text{ECE}{\text{trained}}} \quad \text{target: } < 2.0$$

$$\Delta_{\text{improve}} = \frac{\text{ECE}_{\text{base}} - \text{ECE}_{\text{transfer}}}{\text{ECE}_{\text{base}}} \quad \text{target: } &gt; 0.3$$


Composite Score

$$\text{CAD-B} = 30 \cdot S_{\text{GIS}} + 15 \cdot S_{\text{APD}} + 20 \cdot S_{\text{DWU}} + 25 \cdot S_{\text{OVS}} + 10 \cdot S_{\text{MT}}$$

where $S_i \in [0,1]$ normalizes each test's metrics. Final score $\in [0, 100]$.

Interpretation: $\geq 70$ strong, $60\text{–}70$ moderate, $50\text{–}60$ limited, $&lt; 50$ poor


Implementation

Framework: lm-evaluation-harness
Interface: Standard generate_until + loglikelihood only
Logprob extraction: Token probabilities at decision points
Fallback: Parse structured text when logprobs unavailable
Data: 11K samples total (github.com/[repo]/cadb)

Usage:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-70b-hf \
  --tasks cadb_gis,cadb_apd,cadb_dwu,cadb_ovs,cadb_mt \
  --device cuda:0

Key References

Calibration: Guo et al. (2017) ICML; Kumar et al. (2019) NeurIPS
Verbalized uncertainty: Kadavath et al. (2022) arXiv:2207.05221; Lin et al. (2022) arXiv:2205.14334
Animal metacognition: Smith et al. (2003) BBS 26(3); Hampton (2001) PNAS 98(9); Kornell et al. (2007) Psych Sci 18(1)
Developmental: Flavell (1979) Am Psych 34(10)


License: MIT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment