Last active
January 21, 2021 20:35
-
-
Save BioSciEconomist/826556977841cbf966f03542b2be0c55 to your computer and use it in GitHub Desktop.
inverse probability of treatment weighted regression (IPTW)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# *----------------------------------------------------------------- | |
# | PROGRAM NAME: iptw demo.py | |
# | DATE: 11/27/19 | |
# | CREATED BY: MATT BOGARD | |
# | PROJECT FILE: | |
# *---------------------------------------------------------------- | |
# | PURPOSE: inverse probablilty of treatment weighted regression (IPTW) | |
# *---------------------------------------------------------------- | |
# illustrating results from IPTW regression as an alternative to propensity score matching in python | |
# attempt to override sci notation in output | |
pd.set_option('display.float_format', lambda x: '%3.f' % x) | |
# Data: lalonde - used by Dehejia and Wahba (1999) to evaluate propensity score matching. | |
# Ref: Dehejia, Rajeev and Sadek Wahba. 1999.`Causal Effects in Non-Experimental Studies: Re-Evaluating | |
# the Evaluation of Training Programs.'' Journal of the American Statistical Association 94 (448): | |
# 1053-1062 | |
# Ref: Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of | |
# Confounding in Observational Studies. Multivariate Behav Res. 2011;46(3):399–424. | |
# doi:10.1080/00273171.2011.568786 | |
# Descriptions: | |
# treat - an indicator variable for treatment status. | |
# age - age in years. | |
# educ - years of schooling. | |
# black - indicator variable for blacks. | |
# hisp- indicator variable for Hispanics. | |
# married - indicator variable for martial status. | |
# nodegr - indicator variable for high school diploma. | |
# re74 - real earnings in 1974. | |
# re75 - real earnings in 1975. | |
# re78 - real earnings in 1978. | |
# Import pandas and numpy | |
import pandas as pd | |
import numpy as np | |
#-------------------------------------- | |
# Loading & examining our data | |
#-------------------------------------- | |
# read data (also available at: https://github.com/BioSciEconomist/econometricsense/blob/master/lalonde.csv | |
df = pd.read_csv('/Users/amandabogard/Google Drive/R Training/lalonde.csv') | |
### inspecting data | |
df.columns | |
df.head() | |
df.tail() | |
df.info() | |
df.shape | |
df.index | |
df.describe() | |
#----------------------------------- | |
# estimate propensity scores and calculate weights | |
#---------------------------------- | |
import statsmodels.api as sm # import stastmodels | |
import statsmodels.formula.api as smf # this allows us to use an explicit formulation | |
model = smf.glm('treat ~ age + educ + black + hispan + married + nodegree + re74 + re75', data=df, family=sm.families.Binomial(link = sm.genmod.families.links.logit)) | |
result = model.fit() | |
result.summary() | |
# add propensity scores | |
df['ps'] = result.fittedvalues | |
# explore | |
df.ps.plot( kind='hist', normed=True, bins=30) | |
df.ps.describe() | |
# assess common support | |
df.boxplot(column='ps', by='treat', rot=90) | |
# calculate propensity score weights for ATT (Austin, 2011) | |
df['wt'] = np.where(df['treat']==1, 1,df.ps/(1-df.ps)) | |
df.wt.describe() # check | |
# create weighted data frame (controls are weighted up by 'wt' to look like treatments) | |
from statsmodels.stats.weightstats import DescrStatsW | |
wdf = DescrStatsW(df, weights=df.wt, ddof=1) | |
#---------------------------------- | |
# balance diagnostics | |
#---------------------------------- | |
#### descriptives unmatched or unweighted data | |
# split data and use numpy to deal with pandas issues elated to using group by with means on integer 0/1 values | |
trt = df[df['treat'].isin([1])] | |
ctrl =df[df['treat'].isin([0])] | |
# descriptives for unmatched or unweighted data | |
np.average(trt.black) | |
np.average(ctrl.black) | |
np.average(trt.re74) | |
np.average(ctrl.re74) | |
np.average(trt.age) | |
np.average(ctrl.age) | |
# descriptives with weighted data | |
np.average(trt.black, weights = trt.wt) | |
np.average(ctrl.black, weights = ctrl.wt) | |
np.average(trt.re74, weights = trt.wt) | |
np.average(ctrl.re74, weights = ctrl.wt) | |
np.average(trt.age, weights = trt.wt) | |
np.average(ctrl.age, weights = ctrl.wt) | |
#---------------------------------------- | |
# estimation of treatment effects | |
#---------------------------------------- | |
### unweighted comparisons | |
# regression using smf | |
lmresults = smf.ols('re78 ~ treat', data=df).fit() | |
# inspect results | |
lmresults.summary() # b_est = -635 | |
### weighted comparisons | |
# weighted differences in outcome | |
np.average(trt.re78, weights = trt.wt) | |
np.average(ctrl.re78, weights = ctrl.wt) | |
# iptw regression | |
iptwResults = smf.wls('re78 ~ treat', data=df, weights = df.wt).fit() | |
iptwResults.summary() # b_est = 1214 | |
# this result if very similar to the result using R Matchit | |
# see: https://gist.github.com/BioSciEconomist/6e8b3fba57d00761527215128bdbf11a | |
So sorry for the way to late response - but for future readers - you are correct. I think I have most often seen weights designed to estimate ATE while I attempted to estimate ATT (weighting controls to look like the treatment group). See the reference in my code to the Austin paper for more details.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
df['wt'] = np.where(df['treat']==1, 1,df.ps/(1-df.ps)) - in this equation for treatment users a weight of one is used, but I do see in lot of places inverse of propensity score is used for the treatment group. Is there a reference for the formula you are using. Thanks