Skip to content

Instantly share code, notes, and snippets.

@maresb
Last active March 19, 2026 00:06
Show Gist options
  • Select an option

  • Save maresb/dbff7c6dde7e0f39d1ce9abbdca0e8dc to your computer and use it in GitHub Desktop.

Select an option

Save maresb/dbff7c6dde7e0f39d1ce9abbdca0e8dc to your computer and use it in GitHub Desktop.
Reweighting likelihood contributions in PyMC

Reweighting likelihood contributions in PyMC

Note: This document was generated by Claude (Opus) with human feedback.

Motivation

In a standard PyMC model, each observed data point contributes equally to the joint log-probability. When you write

pm.Normal("obs", mu=mu, sigma=sigma, observed=data)

PyMC computes elementwise log-probabilities and sums them. There is no built-in hook to scale individual contributions before that reduction.

Power likelihoods generalize the standard likelihood by raising each observation's contribution to a per-observation exponent $w_i > 0$:

$$p(\theta \mid y) \propto p(\theta) \prod_i p(y_i \mid \theta)^{w_i}$$

In log space this is simply a weighted sum:

$$\log p(\theta \mid y) = \text{const} + \log p(\theta) + \sum_i w_i \log p(y_i \mid \theta)$$

When $w_i = 1$ the observation contributes at standard strength. Setting $w_i < 1$ downweights (softens) its influence; $w_i > 1$ upweights (sharpens) it. The common case is downweighting ($w_i \le 1$) to correct for redundancy in the data.

Example: fitting a distribution to autocorrelated or duplicated data

Suppose you observe $n$ samples and want to fit them to a parametric family (e.g. a Normal). The standard likelihood assumes the observations are i.i.d., but in practice your data may contain duplicates, near-duplicates, or autocorrelation from a time series. Naively treating all $n$ points as independent overstates the information content: the posterior concentrates as if you had $n$ independent observations when you really have far fewer effective degrees of freedom.

Per-observation weights let you correct for this. If you can estimate each observation's effective multiplicity (e.g. via autocorrelation analysis, or simply by counting how many times a value appears), you assign $w_i < 1$ to redundant points so that the total weight $\sum w_i$ equals the effective sample size rather than the raw count. The posterior then reflects the true amount of information in the data.

Recipe

PyMC's pm.CustomDist lets you supply a custom logp function, which is all you need to inject per-observation weights.

1. Define the weighted logp

pm.logp(dist, value) returns the elementwise log-probability as a PyTensor expression, so you can multiply by a weight tensor before PyMC reduces it into the joint log-probability.

def weighted_logp(value, weight, mu, sigma):
    dist = pm.Normal.dist(mu=mu, sigma=sigma)
    return weight * pm.logp(dist, value)

The signature is logp(value, *dist_params) where value is the observed data tensor and *dist_params are the positional arguments you pass to CustomDist (after the name).

Because the return value is a PyTensor expression, automatic differentiation works and NUTS can compute gradients through the weighted likelihood.

2. Define the random function

The random function is used by pm.sample_prior_predictive() and pm.sample_posterior_predictive(). It should ignore the weights and sample from the base distribution — weights are an inferential device, not part of the data-generating process.

def weighted_random(weight, mu, sigma, rng=None, size=None):
    return pm.Normal.dist(mu=mu, sigma=sigma, rng=rng, size=size)

3. Wire it up with CustomDist

import numpy as np
import pymc as pm

weights = np.array([...])  # per-observation, shape matching data
data = np.array([...])

with pm.Model() as model:
    mu = pm.Normal("mu", mu=0, sigma=10)
    sigma = pm.HalfNormal("sigma", sigma=2)

    pm.CustomDist(
        "obs",
        weights,       # → `weight` in logp / random
        mu,            # → `mu`     in logp / random
        sigma,         # → `sigma`  in logp / random
        logp=weighted_logp,
        random=weighted_random,
        observed=data,
    )

Positional arguments after the name are forwarded to both logp and random in order.

Multidimensional observations

When observations are multivariate (e.g. 2-D points), the weights are typically per-observation rather than per-component. Use broadcasting to apply a 1-D weight vector across the component axis:

# data shape: (n_obs, n_components)
# weights shape: (n_obs,)

pm.CustomDist(
    "obs",
    weights[:, None],  # broadcast over components
    mu,
    sigma,
    logp=weighted_logp,
    random=weighted_random,
    observed=data,
)

Any base distribution

The pattern works with any PyMC distribution. Replace pm.Normal.dist with pm.StudentT.dist, pm.Laplace.dist, or any other distribution that supports pm.logp:

def weighted_logp(value, weight, mu, sigma):
    dist = pm.StudentT.dist(nu=6, mu=mu, sigma=sigma)
    return weight * pm.logp(dist, value)

Special case: constant weight (global tempering)

When all weights are the same scalar $w$, the weighted posterior is:

$$\log p(\theta \mid y) = \text{const} + \log p(\theta) + w \sum_i \log p(y_i \mid \theta)$$

This is equivalent to tempering the entire likelihood. Setting $w < 1$ broadens the posterior by reducing the data's overall influence relative to the prior; setting $w > 1$ sharpens it.

The implementation simplifies: there is no need for per-observation weight arrays, and you can use a scalar directly.

temperature = 0.5  # scalar weight applied to all observations

def tempered_logp(value, temperature, mu, sigma):
    dist = pm.Normal.dist(mu=mu, sigma=sigma)
    return temperature * pm.logp(dist, value)

def tempered_random(temperature, mu, sigma, rng=None, size=None):
    return pm.Normal.dist(mu=mu, sigma=sigma, rng=rng, size=size)

with pm.Model() as model:
    mu = pm.Normal("mu", mu=0, sigma=10)
    sigma = pm.HalfNormal("sigma", sigma=2)

    pm.CustomDist(
        "obs",
        temperature,
        mu,
        sigma,
        logp=tempered_logp,
        random=tempered_random,
        observed=data,
    )

In the constant-weight case, the effect on the posterior is equivalent to observing a fractional number of data points: $w = 0.5$ with $n$ observations yields the same likelihood curvature as $n/2$ observations at full weight. This makes it useful for:

  • Correcting for duplicates / autocorrelation: if your $n$ observations have an effective sample size of $n_{\text{eff}}$, setting $w = n_{\text{eff}} / n$ recovers the appropriate posterior width.
  • Safe Bayes / generalized Bayesian learning: tempering with $w < 1$ guards against model misspecification by preventing the likelihood from overwhelming the prior.
  • Thermodynamic integration: sweeping $w$ from 0 to 1 to estimate the marginal likelihood (model evidence) for Bayesian model comparison.
  • Cold posteriors ($w > 1$): sharpening the posterior beyond what the data alone would justify, sometimes used in deep learning.

Note that because a constant weight does not change the relative contribution of observations, it does not help when different observations have different levels of redundancy — use per-observation weights for that.

Computing weights from autocorrelated data

When the data come from a stationary time series, the observations are not independent: nearby samples are correlated, so they carry less information than the same number of i.i.d. draws. The effective sample size (ESS) quantifies how many independent observations the correlated series is worth.

Autocorrelation and its role

The autocorrelation function $\rho(k)$ measures the correlation between observations separated by lag $k$:

$$\rho(k) = \frac{\text{Cov}(y_t, y_{t+k})}{\text{Var}(y_t)}$$

For i.i.d. data, $\rho(k) = 0$ for all $k > 0$. For positively autocorrelated data (the common case), $\rho(k) > 0$ for small $k$, meaning consecutive observations are partially redundant.

Why autocorrelation inflates posterior precision

Under an i.i.d. likelihood, the variance of the posterior mean scales as $\sigma^2 / n$. But if the data are autocorrelated, the true variance of the sample mean is:

$$\text{Var}(\bar{y}) = \frac{\sigma^2}{n} \left(1 + 2 \sum_{k=1}^{n-1} \left(1 - \frac{k}{n}\right) \rho(k)\right)$$

The parenthesized factor is $> 1$ when autocorrelation is positive, meaning the sample mean is less precise than the i.i.d. formula suggests. An i.i.d. likelihood ignores this and produces a posterior that is too narrow.

Effective sample size

For large $n$, the ESS is:

$$n_{\text{eff}} = \frac{n}{1 + 2 \sum_{k=1}^{\infty} \rho(k)}$$

The denominator $\tau = 1 + 2 \sum_{k=1}^{\infty} \rho(k)$ is called the integrated autocorrelation time. It tells you how many raw samples are worth one independent sample: a series with $\tau = 5$ means every 5 consecutive observations carry roughly the information of 1 independent observation.

Estimating ESS in practice

You cannot sum $\rho(k)$ to infinity, and sample autocorrelation estimates become noisy at large lags. Practical estimators truncate the sum.

Step 1: Estimate the autocorrelation function. Compute the sample autocovariance via FFT (fast, $O(n \log n)$) and normalize by the variance:

import numpy as np

def sample_autocorrelation(y: np.ndarray) -> np.ndarray:
    """Estimate the autocorrelation function using FFT."""
    n = len(y)
    y_centered = y - y.mean()
    # Zero-pad to avoid circular correlation artifacts
    fft_result = np.fft.fft(y_centered, n=2 * n)
    acf_raw = np.fft.ifft(fft_result * np.conj(fft_result)).real[:n]
    # Normalize: acf[0] = variance, divide to get correlation
    return acf_raw / acf_raw[0]

Step 2: Truncate the sum. Raw autocorrelation estimates are noisy at high lags. A common approach (Geyer, 1992) is the initial positive sequence estimator: sum consecutive pairs $\rho(2k) + \rho(2k+1)$, and stop as soon as a pair becomes negative. This exploits the fact that for a stationary process the pairwise sums must be positive and decreasing — a negative pair signals that noise has taken over.

def integrated_autocorrelation_time(rho: np.ndarray) -> float:
    """Estimate the integrated autocorrelation time using Geyer's
    initial positive sequence method."""
    n = len(rho)
    tau = rho[0]  # = 1.0 (the lag-0 term)
    for k in range(1, n // 2):
        pair_sum = rho[2 * k - 1] + rho[2 * k]
        if pair_sum < 0:
            break
        tau += 2 * pair_sum
    return tau

Step 3: Compute ESS.

rho = sample_autocorrelation(data)
tau = integrated_autocorrelation_time(rho)
n_eff = len(data) / tau

From ESS to weights

Constant weight (global tempering). If autocorrelation is roughly stationary across the series, a single scalar weight suffices:

$$w = \frac{n_{\text{eff}}}{n} = \frac{1}{\tau}$$

This is the simplest approach and is appropriate when the autocorrelation structure is homogeneous.

Per-observation weights. If the autocorrelation varies over the series (e.g. a non-stationary process, or a series with gaps), you can compute local ESS in sliding windows and derive per-observation weights. A simple scheme:

  1. Divide the series into overlapping windows.
  2. Estimate $\tau$ within each window.
  3. Assign $w_i = 1 / \tau_{\text{local}}$ for observations in that window.
  4. Rescale so that $\sum w_i = n_{\text{eff}}$ (the global ESS).

Duplicates as a special case

Exact duplicates are the extreme case of autocorrelation ($\rho = 1$ within a group of copies). If value $y$ appears $k$ times, each copy should get weight $w = 1/k$: the group then contributes a total weight of 1, equivalent to a single independent observation of $y$. This is consistent with the general framework — perfect correlation within a group gives an integrated autocorrelation time of $\tau = k$, so $w = 1/\tau = 1/k$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment