-
-
Save albertbuchard/43231bb47a076594f36514e5b4dbc1ef to your computer and use it in GitHub Desktop.
import numpy as np | |
import pandas as pd | |
import cvxpy as cp | |
from typing import Sequence, Union | |
def entropy_balance_ipaw( | |
df: pd.DataFrame, | |
*, | |
baseline_covariates: Union[Sequence[str], None] = None, | |
base_weight_col: str = "ipaw_true", | |
session_col: str = "session", | |
baseline_session: int = 1, | |
ridge: float = 1e-3, # L₂ penalty on imbalance | |
out_col: str = "ipaw_ebal", | |
solver: str = "ECOS", | |
) -> pd.DataFrame: | |
""" | |
Ridge-penalised entropy balancing of existing IPAW weights. | |
Guarantees finite, non-negative weights and never returns NaN. | |
Objective (per session s ≠ baseline): | |
minimise Σ_i KL(w_i || w0_i) + ridge · || Zᵀ w − μ₀ Σ w0 ||² | |
subject to Σ_i w_i = Σ_i w0_i | |
w_i ≥ 1e-8 · mean(w0) (numeric lower bound) | |
""" | |
if baseline_covariates is None: | |
baseline_covariates = ("age", "sex") | |
df = df.copy() | |
# baseline (session = baseline_session) means | |
mu0 = ( | |
df.loc[df[session_col] == baseline_session, baseline_covariates] | |
.mean() | |
.to_numpy() | |
) | |
new_w = np.empty(len(df), dtype=float) | |
for s, g in df.groupby(session_col, sort=True): | |
idx = g.index | |
w0 = g[base_weight_col].to_numpy() | |
# keep baseline weights unchanged | |
if s == baseline_session: | |
new_w[idx] = w0 | |
continue | |
Z = g[list(baseline_covariates)].to_numpy(float) | |
n = len(w0) | |
# numeric lower bound prevents under-flow when w0 is very small | |
lb = 1e-8 * w0.mean() | |
w = cp.Variable(n, nonneg=True) | |
imbalance = Z.T @ w - mu0 * w0.sum() | |
obj = cp.Minimize( | |
cp.sum(cp.rel_entr(w, w0)) + ridge * cp.sum_squares(imbalance) | |
) | |
constraints = [cp.sum(w) == w0.sum(), w >= lb] | |
prob = cp.Problem(obj, constraints) | |
prob.solve(solver=solver, verbose=False) | |
# ── graceful fall-backs ──────────────────────────────────────── | |
if prob.status not in ("optimal", "optimal_inaccurate") or w.value is None: | |
new_w[idx] = w0 | |
continue | |
# OPTIONAL: rerun with exact equality when the ridge solution | |
# already hits the constraints up to machine precision | |
if imbalance.value is not None and np.linalg.norm(imbalance.value) < 1e-10: | |
constraints[0] = Z.T @ w == mu0 * w0.sum() | |
prob_eq = cp.Problem(cp.Minimize(cp.sum(cp.rel_entr(w, w0))), constraints) | |
prob_eq.solve(solver=solver, verbose=False) | |
if ( | |
prob_eq.status in ("optimal", "optimal_inaccurate") | |
and w.value is not None | |
): | |
new_w[idx] = w.value | |
continue | |
new_w[idx] = w.value | |
df[out_col] = new_w | |
return df |
Entropy-Balanced IPAW: Causal Considerations
This function, entropy_balance_ipaw
, refines an existing set of Inverse Probability of Attrition Weights (IPAW) using ridge-penalized entropy balancing. While entropy balancing can improve covariate balance and produce numerically stable weights, its impact on the causal guarantees of IPAW (specifically, blocking backdoor paths between attrition and an outcome) depends crucially on how it's applied.
How IPAW Blocks Backdoor Paths
Standard IPAW aims to create a pseudo-population where confounders are balanced across different levels of "treatment" (i.e., remaining in the study versus attriting). The original weights (base_weight_col
in this function) are derived from a model:
Here,
Entropy Balancing and Causal Guarantees
The entropy_balance_ipaw
function adjusts baseline_covariates
.
1. When Causal Protection is Preserved ("Guaranteed Safe")
The crucial insight is that if the entropy balancing procedure constrains or models exactly the same set of confounders $L$ (or a strict superset of
-
Mechanism: If
$w^{(0)}$ correctly adjusted for$L$ , and the entropy balancing adjustment$c(L_i)$ is purely a function of$L$ (i.e.,baseline_covariates
is identical to$L$ ), then the new weights $w^{()} = w^{(0)} \cdot c(L_i)$ still ensure that the potential outcomes $Y()$ are independent of attrition$R$ given$L$ . The conditional exchangeability, once established by$w^{(0)}$ based on$L$ , is not broken by a further re-weighting that is also solely based on$L$ . -
Practical Implication for this Function: To maintain causal guarantees, the
baseline_covariates
parameter inentropy_balance_ipaw
must include all variables that were part of the original confounder set$L$ used to derivebase_weight_col
.
2. How Causal Protection Can Be Accidentally Compromised
Bias can be inadvertently re-introduced if care is not taken:
Scenario | Why Bias Can Creep Back In |
---|---|
Balance only on a subset |
If baseline_covariates is only a subset of the original |
Add balancing constraints on variables |
✦ If ✦ If ✦ If aggressive re-weighting occurs for |
Use outcome/post-baseline variables in baseline_covariates |
This directly breaks the backdoor block. The weights would now depend on variables affected by attrition or on the causal pathway to the outcome. |
Original |
If |
3. Best Practices for Causal Safety with entropy_balance_ipaw
To leverage the benefits of entropy balancing (like improved empirical balance and numerical stability) without compromising causal inference:
-
Define
$L$ Broadly (for original IPAW): Ensure the model generatingbase_weight_col
includes all plausible pre-attrition confounders ($L$ ). -
Diagnose Original Balance: Check the balance achieved by
base_weight_col
on all covariates in$L$ . -
Use Full
$L$ for Entropy Balancing: When callingentropy_balance_ipaw
, setbaseline_covariates
to be identical to the full set$L$ used forbase_weight_col
. Do not omit covariates from this set, even if they appeared balanced by$w^{(0)}$ , as the optimization could inadvertently disrupt their balance. -
Penalize (Ridge): The
ridge
parameter helps prevent extreme weights, especially with smaller sample sizes or when perfect balance is hard to achieve. This is generally safer than forcing exact balance. -
No Outcome/Post-Baseline Variables: Never include outcome data or variables affected by attrition (post-baseline stressors that are not part of
$L$ ) inbaseline_covariates
.
Bottom Line for entropy_balance_ipaw
-
Causally Safe Use: When
baseline_covariates
in this function is set to the complete set of confounders$L$ that defined the original IPAW (base_weight_col
), the resultingipaw_ebal
weights preserve the backdoor path blocking properties of the original IPAW. This function then serves to potentially improve model-robustness and variance control for the estimation of causal effects. -
Risky Use: If
baseline_covariates
represents only a subset of$L$ , or introduces variables inappropriately, the causal protection can be weakened or broken.
This function offers a powerful way to refine weights, but its application in a causal inference context requires careful consideration of the confounder set
pip install cvxpy ecos