Skip to content

Instantly share code, notes, and snippets.

@albertbuchard
Created May 15, 2025 13:58
Show Gist options
  • Save albertbuchard/3450e9e3b5fe0e4d61a15c222b5995fb to your computer and use it in GitHub Desktop.
Save albertbuchard/3450e9e3b5fe0e4d61a15c222b5995fb to your computer and use it in GitHub Desktop.
O3 Answers: Value of Information and Active Learning

Active learning in one line Keep a model $p_\theta(y\mid x,\mathcal D)$ that already knows something ($\mathcal D$); then ask for the single measurement $(x^*,y^*)$ that, once answered, will help most.


1. Value of Information (VoI)

Concept Formal core Intuition for the next query
Decision-centric $\text{VoI}(x)=\mathbb E_{y\sim p_\theta(y\mid x,\mathcal D)}\big[\Delta U(\theta\mid x,y)\big]$ “If I were given the label of $x$, how much would my future utility rise on average?”
Information-theoretic $\text{VoI}(x)=\text{IG}(x)=\mathrm H[\theta\mid\mathcal D]-\mathbb E_{y}\big[\mathrm H[\theta\mid\mathcal D\cup{(x,y)}]\big]$ “How many bits of uncertainty about the model parameters will this label erase?”
Predictive $\text{VoI}(x)=\mathrm H[y\mid x,\mathcal D]-\mathbb E_{y}\big[\mathrm H[y'\mid x',\mathcal D\cup{(x,y)}]\big]_{x'}$ (BALD) “How much will overall test-set entropy fall if I know $y$?”

Key properties

  • Prospective: computed before seeing $y$.
  • Expectation over outcomes: guards against chasing single exotic labels.
  • Optimality: if the utility and full posterior are exact, a myopic “+1 sample” policy that maximises VoI is Bayes-optimal.

2. Surprisal (Self-Information)

$$ I(y\mid x,\mathcal D)=-\log p_\theta(y\mid x,\mathcal D) $$

Measures how shocking the realised answer is after you see it. High surprisal tells you this data point violated the model’s expectations.

Usage patterns

  • Diagnostics: flag outliers, distribution shift, model bugs.
  • Exploration heuristics (empirical Bayes, deep ensembles): select $x$ whose expected surprisal is high or whose distribution of possible surprisals is wide (“disagreement”).
  • Stochastic optimism: even if the actual label is not yet known, one may favour queries whose potential surprisal is large because that correlates with larger VoI in many tasks.

3. Relationship inside an active-learning loop

Stage Quantity Role
Before querying Expected surprisal $\mathbb E_{y}[I(y\mid x)] = \mathrm H[y\mid x]$ Equals predictive entropy—one of the simplest VoI proxies (max-entropy sampling).
Full VoI / Information Gain Gold standard but expensive (needs posterior update for every hypothetical $y$).
After querying Realised surprisal $I(y^*)$ Empirical pay-off; large $I$ usually coincides with larger-than-normal model update.

Why we still compute VoI instead of just chasing surprisal: Surprisal is retrospective; you may get a huge shock but only on a vanishingly unlikely branch. VoI averages those branches, ensuring each step yields expected benefit.


4. Choosing the next $x^*$ in practice

  1. Specify utility — classification accuracy, log-likelihood, decision-theoretic loss, etc.

  2. Approximate VoI

    • Bayesian neural nets / GPs: use closed-form mutual information or Monte-Carlo.
    • Ensemble disagreement: $\mathrm H[y]-\mathbb E_m[\mathrm H_m[y]]$ (BALD proxy).
    • Parameter-change norms: expected gradient length, Fisher information.
  3. Complexity control — batch selection, diversity penalties, cost-aware VoI.

  4. Query label, update posterior, repeat.


5. Practical heuristics at a glance

Heuristic (cheap → costly) VoI link Typical when
Uncertainty sampling (max predictive entropy) Lower-bound to VoI Posterior over $y$ easy, posterior over $\theta$ hard.
Query-by-committee / ensembles Expected pairwise KL ≈ mutual information Deep models without tractable Bayesian posterior.
Variance-reduction / Fisher-score Approximates expected parameter change Large-scale regression, cost-sensitive tasks.
Exact information gain True VoI Small domains (Gaussian processes, discrete graphs).

Take-home distinctions

  • VoI: ex-ante, expectation, goal-aligned.
  • Surprisal: ex-post, realised shock, good thermometer of how much the last query actually taught you.
  • High expected surprisal = high predictive entropy = one component of VoI; full VoI also weighs how that surprise will propagate to future predictions or decisions.

Use expected VoI to plan your next question; use realised surprisal to judge how much you actually learned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment