O3 Answers: Value of Information and Active Learning

Active learning in one line Keep a model $p_\theta(y\mid x,\mathcal D)$ that already knows something ($\mathcal D$); then ask for the single measurement $(x^*,y^*)$ that, once answered, will help most.

1. Value of Information (VoI)

Concept	Formal core	Intuition for the next query
Decision-centric	$\text{VoI}(x)=\mathbb E_{y\sim p_\theta(y\mid x,\mathcal D)}\big[\Delta U(\theta\mid x,y)\big]$	“If I were given the label of $x$, how much would my future utility rise on average?”
Information-theoretic	$\text{VoI}(x)=\text{IG}(x)=\mathrm H[\theta\mid\mathcal D]-\mathbb E_{y}\big[\mathrm H[\theta\mid\mathcal D\cup{(x,y)}]\big]$	“How many bits of uncertainty about the model parameters will this label erase?”
Predictive	$\text{VoI}(x)=\mathrm H[y\mid x,\mathcal D]-\mathbb E_{y}\big[\mathrm H[y'\mid x',\mathcal D\cup{(x,y)}]\big]_{x'}$ (BALD)	“How much will overall test-set entropy fall if I know $y$?”

Key properties

Prospective: computed before seeing $y$.
Expectation over outcomes: guards against chasing single exotic labels.
Optimality: if the utility and full posterior are exact, a myopic “+1 sample” policy that maximises VoI is Bayes-optimal.

2. Surprisal (Self-Information)

$$ I(y\mid x,\mathcal D)=-\log p_\theta(y\mid x,\mathcal D) $$

Measures how shocking the realised answer is after you see it. High surprisal tells you this data point violated the model’s expectations.

Usage patterns

Diagnostics: flag outliers, distribution shift, model bugs.
Exploration heuristics (empirical Bayes, deep ensembles): select $x$ whose expected surprisal is high or whose distribution of possible surprisals is wide (“disagreement”).
Stochastic optimism: even if the actual label is not yet known, one may favour queries whose potential surprisal is large because that correlates with larger VoI in many tasks.

3. Relationship inside an active-learning loop

Stage	Quantity	Role
Before querying	Expected surprisal $\mathbb E_{y}[I(y\mid x)] = \mathrm H[y\mid x]$	Equals predictive entropy—one of the simplest VoI proxies (max-entropy sampling).
	Full VoI / Information Gain	Gold standard but expensive (needs posterior update for every hypothetical $y$).
After querying	Realised surprisal $I(y^*)$	Empirical pay-off; large $I$ usually coincides with larger-than-normal model update.

Why we still compute VoI instead of just chasing surprisal: Surprisal is retrospective; you may get a huge shock but only on a vanishingly unlikely branch. VoI averages those branches, ensuring each step yields expected benefit.

4. Choosing the next $x^*$ in practice

Specify utility — classification accuracy, log-likelihood, decision-theoretic loss, etc.
Approximate VoI
- Bayesian neural nets / GPs: use closed-form mutual information or Monte-Carlo.
- Ensemble disagreement: $\mathrm H[y]-\mathbb E_m[\mathrm H_m[y]]$ (BALD proxy).
- Parameter-change norms: expected gradient length, Fisher information.
Complexity control — batch selection, diversity penalties, cost-aware VoI.
Query label, update posterior, repeat.

5. Practical heuristics at a glance

Heuristic (cheap → costly)	VoI link	Typical when
Uncertainty sampling (max predictive entropy)	Lower-bound to VoI	Posterior over $y$ easy, posterior over $\theta$ hard.
Query-by-committee / ensembles	Expected pairwise KL ≈ mutual information	Deep models without tractable Bayesian posterior.
Variance-reduction / Fisher-score	Approximates expected parameter change	Large-scale regression, cost-sensitive tasks.
Exact information gain	True VoI	Small domains (Gaussian processes, discrete graphs).

Take-home distinctions

VoI: ex-ante, expectation, goal-aligned.
Surprisal: ex-post, realised shock, good thermometer of how much the last query actually taught you.
High expected surprisal = high predictive entropy = one component of VoI; full VoI also weighs how that surprise will propagate to future predictions or decisions.

Use expected VoI to plan your next question; use realised surprisal to judge how much you actually learned.