Skip to content

Instantly share code, notes, and snippets.

@albertbuchard
Created May 25, 2025 01:20
Show Gist options
  • Save albertbuchard/43231bb47a076594f36514e5b4dbc1ef to your computer and use it in GitHub Desktop.
Save albertbuchard/43231bb47a076594f36514e5b4dbc1ef to your computer and use it in GitHub Desktop.
Entropy-Balanced IPAW: Efficient implementation of ridge-penalised entropy balancing applied to inverse probability of attrition weights (IPAW), ensuring finite, non-negative weights for longitudinal studies.
import numpy as np
import pandas as pd
import cvxpy as cp
from typing import Sequence, Union
def entropy_balance_ipaw(
df: pd.DataFrame,
*,
baseline_covariates: Union[Sequence[str], None] = None,
base_weight_col: str = "ipaw_true",
session_col: str = "session",
baseline_session: int = 1,
ridge: float = 1e-3, # L₂ penalty on imbalance
out_col: str = "ipaw_ebal",
solver: str = "ECOS",
) -> pd.DataFrame:
"""
Ridge-penalised entropy balancing of existing IPAW weights.
Guarantees finite, non-negative weights and never returns NaN.
Objective (per session s ≠ baseline):
minimise Σ_i KL(w_i || w0_i) + ridge · || Zᵀ w − μ₀ Σ w0 ||²
subject to Σ_i w_i = Σ_i w0_i
w_i ≥ 1e-8 · mean(w0) (numeric lower bound)
"""
if baseline_covariates is None:
baseline_covariates = ("age", "sex")
df = df.copy()
# baseline (session = baseline_session) means
mu0 = (
df.loc[df[session_col] == baseline_session, baseline_covariates]
.mean()
.to_numpy()
)
new_w = np.empty(len(df), dtype=float)
for s, g in df.groupby(session_col, sort=True):
idx = g.index
w0 = g[base_weight_col].to_numpy()
# keep baseline weights unchanged
if s == baseline_session:
new_w[idx] = w0
continue
Z = g[list(baseline_covariates)].to_numpy(float)
n = len(w0)
# numeric lower bound prevents under-flow when w0 is very small
lb = 1e-8 * w0.mean()
w = cp.Variable(n, nonneg=True)
imbalance = Z.T @ w - mu0 * w0.sum()
obj = cp.Minimize(
cp.sum(cp.rel_entr(w, w0)) + ridge * cp.sum_squares(imbalance)
)
constraints = [cp.sum(w) == w0.sum(), w >= lb]
prob = cp.Problem(obj, constraints)
prob.solve(solver=solver, verbose=False)
# ── graceful fall-backs ────────────────────────────────────────
if prob.status not in ("optimal", "optimal_inaccurate") or w.value is None:
new_w[idx] = w0
continue
# OPTIONAL: rerun with exact equality when the ridge solution
# already hits the constraints up to machine precision
if imbalance.value is not None and np.linalg.norm(imbalance.value) < 1e-10:
constraints[0] = Z.T @ w == mu0 * w0.sum()
prob_eq = cp.Problem(cp.Minimize(cp.sum(cp.rel_entr(w, w0))), constraints)
prob_eq.solve(solver=solver, verbose=False)
if (
prob_eq.status in ("optimal", "optimal_inaccurate")
and w.value is not None
):
new_w[idx] = w.value
continue
new_w[idx] = w.value
df[out_col] = new_w
return df
@albertbuchard
Copy link
Author

Entropy-Balanced IPAW: Causal Considerations

This function, entropy_balance_ipaw, refines an existing set of Inverse Probability of Attrition Weights (IPAW) using ridge-penalized entropy balancing. While entropy balancing can improve covariate balance and produce numerically stable weights, its impact on the causal guarantees of IPAW (specifically, blocking backdoor paths between attrition and an outcome) depends crucially on how it's applied.

How IPAW Blocks Backdoor Paths

Standard IPAW aims to create a pseudo-population where confounders are balanced across different levels of "treatment" (i.e., remaining in the study versus attriting). The original weights ($w^{(0)}$, typically base_weight_col in this function) are derived from a model:

$w^{(0)}_{it} = 1 / \text{Pr}(\text{Subject } i \text{ remains at session } t \mid L_i)$ (or a stabilized version)

Here, $L$ represents the complete set of confounders that create backdoor paths between attrition and the outcome of interest. If $L$ is correctly and fully specified in this initial attrition model, and positivity holds, then weighting by $w^{(0)}$ ensures conditional exchangeability: the potential outcomes $Y(*)$ are independent of attrition $R$ given $L$. This blocks the backdoor paths.

Entropy Balancing and Causal Guarantees

The entropy_balance_ipaw function adjusts $w^{(0)}$ to new weights $w^{(*)}$ by minimizing the KL divergence from $w^{(0)}$ while enforcing (penalized) balance on a specified set of baseline_covariates.

1. When Causal Protection is Preserved ("Guaranteed Safe")

The crucial insight is that if the entropy balancing procedure constrains or models exactly the same set of confounders $L$ (or a strict superset of $L$) that were used to generate the original $w^{(0)}$ weights, the backdoor path blocking property is preserved.

  • Mechanism: If $w^{(0)}$ correctly adjusted for $L$, and the entropy balancing adjustment $c(L_i)$ is purely a function of $L$ (i.e., baseline_covariates is identical to $L$), then the new weights $w^{()} = w^{(0)} \cdot c(L_i)$ still ensure that the potential outcomes $Y()$ are independent of attrition $R$ given $L$. The conditional exchangeability, once established by $w^{(0)}$ based on $L$, is not broken by a further re-weighting that is also solely based on $L$.
  • Practical Implication for this Function: To maintain causal guarantees, the baseline_covariates parameter in entropy_balance_ipaw must include all variables that were part of the original confounder set $L$ used to derive base_weight_col.

2. How Causal Protection Can Be Accidentally Compromised

Bias can be inadvertently re-introduced if care is not taken:

Scenario Why Bias Can Creep Back In
Balance only on a subset $L_{\text{sub}} &lt; L$ If baseline_covariates is only a subset of the original $L$, the calibration might improve balance on $L_{\text{sub}}$ but worsen it for the remaining confounders in $L \setminus L_{\text{sub}}$. Exchangeability based on the full $L$ is no longer guaranteed.
Add balancing constraints on variables $X \not\in L$ ✦ If $X$ is not a confounder: No new bias, but potentially increased variance.
✦ If $X$ is a hidden confounder (missed by original $L$): This could be beneficial by effectively upgrading $L$.
✦ If aggressive re-weighting occurs for $X$: May violate positivity or inflate variance.
Use outcome/post-baseline variables in baseline_covariates This directly breaks the backdoor block. The weights would now depend on variables affected by attrition or on the causal pathway to the outcome.
Original $w^{(0)}$ from a severely misspecified attrition model If $L$ was incomplete in the first place, $w^{(0)}$ never fully blocked backdoor paths. Entropy balancing can't fix this underlying omitted variable bias; it might only redistribute existing bias.

3. Best Practices for Causal Safety with entropy_balance_ipaw

To leverage the benefits of entropy balancing (like improved empirical balance and numerical stability) without compromising causal inference:

  1. Define $L$ Broadly (for original IPAW): Ensure the model generating base_weight_col includes all plausible pre-attrition confounders ($L$).
  2. Diagnose Original Balance: Check the balance achieved by base_weight_col on all covariates in $L$.
  3. Use Full $L$ for Entropy Balancing: When calling entropy_balance_ipaw, set baseline_covariates to be identical to the full set $L$ used for base_weight_col. Do not omit covariates from this set, even if they appeared balanced by $w^{(0)}$, as the optimization could inadvertently disrupt their balance.
  4. Penalize (Ridge): The ridge parameter helps prevent extreme weights, especially with smaller sample sizes or when perfect balance is hard to achieve. This is generally safer than forcing exact balance.
  5. No Outcome/Post-Baseline Variables: Never include outcome data or variables affected by attrition (post-baseline stressors that are not part of $L$) in baseline_covariates.

Bottom Line for entropy_balance_ipaw

  • Causally Safe Use: When baseline_covariates in this function is set to the complete set of confounders $L$ that defined the original IPAW (base_weight_col), the resulting ipaw_ebal weights preserve the backdoor path blocking properties of the original IPAW. This function then serves to potentially improve model-robustness and variance control for the estimation of causal effects.
  • Risky Use: If baseline_covariates represents only a subset of $L$, or introduces variables inappropriately, the causal protection can be weakened or broken.

This function offers a powerful way to refine weights, but its application in a causal inference context requires careful consideration of the confounder set $L$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment