Verina-Armanyous · November 26, 2020 23:45
diff --git a/assignment_3.rmd b/assignment_3.rmd
 ---
 title: "CS112 Assignment 3, Fall 2020"
 author: "Verina Armanyous"
 date: "11/01/2020"
 output:
  html_document:
    df_print: paged
  latex_engine: xelatex
  word_document: default
  pdf_document: default
 ---

 ```{r setup, include=FALSE}
 # Don't change this part of the document
 knitr::opts_chunk$set(echo = TRUE, warning=FALSE, 
                      message=FALSE, fig.width=6, fig.align="center")
 # load the necessary packages
 # If you don't have the following packages installed,
 # please install them first. But don't include the installation
 # code here because every time you knit this document they'll 
 # be reinstalled which is not necessary!
 library(Matching)
 library(MatchIt)
 library(cobalt)
 library(knitr)
 library(janitor)
 library(tidyverse)
 library(gridExtra)
 # we need to set the seed of R's random number generator, 
 # in order to produce comparable results 
 set.seed(928)
 ```

 # A few important notes

 **Option 1 for submitting your assignment**: *This method is actually preferred. This is an RMarkdown document. Did you know you can open this document in RStudio, edit it by adding your answers and code, and then knit it to a pdf? To submit your answers to this assignment, simply knit this file as a pdf and submit it as a pdf on Forum. All of your code must be included in the resulting pdf file, i.e., don't set echo = FALSE in any of your code chunks. To learn more about RMarkdown, watch the videos from session 1 and session 2 of the CS112B optional class. [This](https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) is also a cheat sheet for using Rmarkdown. If you have questions about RMarkdown, please post them on Perusall. Try knitting this document in your RStudio. You should be able to get a pdf file. At any step, you can try knitting the document and recreate a pdf. If you get error, you might have incomplete code.*

 **Option 2 for submitting your assignment**: *If you are not comfortable with RMarkdown, you can also choose the Google Doc version of this assignment, make a copy of it and edit the Google doc (include your code, figures, results, and explanations) and at the end download your Google Doc as a pdf and submit the pdf file.*

 **Note**: *Either way (if you use Rmd and knit as pdf OR if you use Google Doc and download as pdf) you should make sure you put your name on top of the document.*

 **Note**: *The first time you run this document you may get error that some packages don't exist. If you don't have the packages listed on top of this document, install them first and you won't get those errors.*

 **Note**: *If you work with others or get help from others on the assignment, please provide the names of your partners at the top of this document. Working together is fine, but all must do and understand their own work and have to mention the collaboration on top of this document.*

 **Note**: *Don't change seed in the document. The function `set.seed()` has already been set at the beginning of this document to 928. Changing the see again to a different number will make your results not replicable.*


 ## QUESTION 1: Data Generating Example

 The following code, creates a toy dataset with a treatment variable, $D$, an outcome variable, $Y$, and other variables $V_1$ to $V_4$. 

 ```{r}
 n = 1000
 ## Generating a random data set here
 #Syntax for the normal distribution here is rnorm(sample size, mean, SD)
 V1 = rnorm(n, 45, 10)
 #getting a binary variable
 V2 = sample(c(1,0), 
             replace = TRUE, 
             size = n,
             prob = c(.4,.6))
 V3 = rnorm(n, V1/10, 1)
 V4 = rnorm(n, 0, 1)
 D  = as.numeric(pnorm(rnorm(n, .01*V1 + .8*V2 + 0.3*V3 + V4, 1), .45 + .32 + .3*4.5, 1) 
                > .5)
 Y  = rnorm(n, .8*D - 0.45*V2 - .4*V3 + 2, .2)
 # combining everything in a data frame
 df = data.frame(V1, V2, V3, V4, D, Y)
 ```

 #### STEP 1

 From the variables $V_1$, $V_2$, $V_3$, and $V_4$, which one(s) are not confounding variable(s) (covariates that causes confounding)? Remember, a rule of thumb (although not a perfect rule) is that the variable is correlated with both the treatment variable and the outcome variable. Explain!

 From examining the formulas of D & Y and how the treatment and outcome variables are created, we notice that the common variables incorporated in both D & Y are V2 and V3. So, since V2 & V3 affect the outcome and treatment variables and are correlated with them, V2 and V3 are confounders whereas V1 & V4 are not confounding variables as they appear in D only and not Y. 

 #### STEP 2

 Can you figure out the true treatment effect by looking at the data generating process above?

 To compute the treatment effect we could use this formula: Y(1) - Y(0).
 - Y(1) = .8(1) - 0.45V2 - .4V3 + 2 and  Y(0) = .8(0) - 0.45V2 - .4V3 + 2
 - Thus Y(1) - Y(0) =  (.8(1) - 0.45V2 - .4V3 + 2) - (.8(0) - 0.45V2 - .4V3 + 2) = 0.8

 So, the treatment effect is approximately 0.8 

 #### STEP 3

 Plot the outcome variable against the treatment variable. Make sure you label your axes. Do you see a trend?

 ```{r}
 # Your code here
 reg1 <- lm(Y~D,data=df)
 plot(x = df$D,y = df$Y,
   xlab = "Treatment",
   ylab = "Outcome",
   main = "Treatment vs Outcome"
 )
 abline(reg1, col="red")

 ```
 After plotting the regression line, we can see a positive correlation between treatment and outcome, indicating that treated values (D = 1) have slightly higher values of Y compared to values in control (D = 0), where a positive treatment effect is manifested. 

 #### STEP 4

 Are the variables $V_1$, $V_2$, $V_3$, and $V_4$ balanced across the treatment and control groups? You can use any R function from any package to check this (for instance, you can check the cobalt package). Make sure you check all variables. 

 **Note**: *This is optional but you can use the `gridExtra` package and its `grid.arrange()` function to put all the 4 graphs in one 2 x 2 graph. Read more about the package and how to use it here: https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html. Set `nrow = 2`.*


 ```{r}
 v1_dist<- bal.plot(df, var.name = "V1",  treat = df$D, which = "unadjusted")
 v2_dist <- bal.plot(df, var.name = "V2", treat = df$D, which = "unadjusted")
 v3_dist <- bal.plot(df, var.name = "V3", treat = df$D, which = "unadjusted")
 v4_dist <- bal.plot(df, var.name = "V4", treat = df$D, which = "unadjusted")
 grid.arrange(v1_dist, v2_dist,  v3_dist, v4_dist, nrow = 2)
 ```

 The variables (particularly the continuous variables V1 & V3) have somewhat balanced distribution across the treatment and control groups, whereas there are slight disparities & imbalance between control & treatment groups in terms of the binary variable V2 as well as V4.  

 #### STEP 5

 Write code that would simply calculate the Prima Facie treatment effect in the data above. What's the Prima Facie treatment effect? Note that the outcome variable is binary.

 ```{r}
 # simply the average observed for treatment minus average observed for control 
 mean(df$Y[df$D ==1]) - mean(df$Y[df$D ==0])
 ```


 #### STEP 6

 Explain why the Prima Facie effect is not the true average causal effect from Step 2.

 To calculate the Prima Facie effect, we subtract the average outcome observed for control from the average outcome observed for treatment. Thereby, the formula does not include the counterfactuals where we observe what could happen to one unit in both control and treatment groups simultaneously (assuming unrealistic existence of two parallel universes where it would be possible). 

 #### STEP 7

 We can use matching to create a better balance of the covariates. Use a propensity score model that includes all the variables $V_1$, $V_2$, $V_3$, and $V_4$.

 ```{r}
 # Your code here

 glm1 <- glm(D ~ V1 + V2 + V3 + V4, family = binomial, data = df)
 rr_1 <- Match(Y =df$Y, Tr = df$D, X =  glm1$fitted, M = 1)


 ```


 #### STEP 8

 Check the balance of the covariates. Do you see any improvements after matching?


 ```{r}
 # Your code here
 mb1 <- MatchBalance(D ~ V1 + V2 + V3 + V4, data = df, match.out = rr_1, nboots = 10)

 ```
 Although the minimum p.value is the same as before matching, we observe slight improvement for all variables where each variable's mean treatment & control values became closer. Also, the standard mean difference for each variable dropped significantly after matching (e.g., V1's standard mean difference reduced from 46.267 to 10.572). 

 #### STEP 9

 What is the treatment effect after matching? Is this surprising given your answer to Step 2. Is the treatment effect found in this Step closer to the treatment effect in Step 2 than the treatment effect before matching?

 ```{r}
 # Your code here
 summary(rr_1)
 ```
 It is 0.81326, which is closer to the true treatment effect than the Prima Facie effect. Matching improved the balance of confounders' distribution across treatment & control groups and matched units, which better addressed the fundamental problem of causal inference and approximated the true treatment effect due treatment not observed confounders. 

 ## QUESTION 2: Daughters

 Read Section 5 (which is only about 1 page long!) of Iacus, King, Porro (2011), Multivariate Matching Methods That Are Monotonic Imbalance Bounding, JASA, V 106, N. 493, available here: https://gking.harvard.edu/files/gking/files/cem_jasa.pdf. Don't worry about the "CEM" elements. Focus on the "daughters" case study.

 Data for this case study is available in "doughters" below.

 ```{r}
 daughters = read.csv("http://bit.ly/daughters_data") %>% 
  clean_names()
 ```

 #### STEP 1

 Before doing any matching, run a regression, with the outcome variable being `nowtot`, the treatment variable being `hasgirls`, and the independent vars mentioned below:
 - dems,
 - repubs,
 - christian,
 - age,
 - srvlng,
 - demvote

 Show the regression specification. Use the regression to estimate a treatment effect and confidence interval. Check the balance of this not-matched data set using any method of your choice (balance tables, balance plots, love plots, etc).

 ```{r}
 # computing linear regression 
 daughters_lm1 <- lm(nowtot ~ hasgirls + dems + repubs + christian + age + srvlng + demvote,
                    data=daughters)
 ```


 ```{r}
 summary(daughters_lm1)
 ```

 ```{r}
 #estimating confidence interval of treatment effect 
 confint(daughters_lm1)[2, ]
 ```
 - As shown above, 'hasgirls' estimated coefficient is -0.4523, meaning that on average each additional girl causes a legislator to vote 0.4523 points less aligned with the National Organization for Women (NOW) with a 95% confidence interval of [-4.194060, 3.289525].  

 ```{r}
 #Note: I got an error when I ran bal.plot like in Q3, so I used the code Prof.Hadavand suggested on Perusall. 

 daughters_p1 <- bal.plot(daughters$hasgirls ~ daughters$dems, type = "histogram", treat = daughters$hasgirls, mirror = TRUE)

 daughters_p2 <- bal.plot(daughters$hasgirls ~ daughters$repubs, type = "histogram", treat = daughters$hasgirls, mirror = TRUE)

 daughters_p3 <- bal.plot(daughters$hasgirls ~ daughters$christian, type = "histogram", treat = daughters$hasgirls, mirror = TRUE)

 daughters_p4 <- bal.plot(daughters$hasgirls ~ daughters$age, type = "histogram", treat = daughters$hasgirls, mirror = TRUE)

 daughters_p5 <- bal.plot(daughters$hasgirls ~ daughters$srvlng, type = "histogram", treat = daughters$hasgirls, mirror = TRUE)
 daughters_p6 <- bal.plot(daughters$hasgirls ~ daughters$demvote, type = "histogram", treat = daughters$hasgirls, mirror = TRUE)

 grid.arrange(daughters_p1 , daughters_p2, daughters_p3, daughters_p4, daughters_p5, daughters_p6, nrow = 2)
 ```



 #### STEP 2

 Then, do genetic matching. Use the same variables as in the regression above. Make sure to set `estimand = "ATT"`. What's your treatment effect?

 **Note**: *For replicability purposes, we need to choose a see for the `GenMatch()` function. However, setting seed for `GenMatch()` is different. The usual practice of typing `set.seed(some number)` before the GenMatch line doesn't ensure stochastic stability. To set seeds for `GenMatch`, you have to run `GenMatch()` including instructions for genoud within `GenMatch()`, e.g.: `GenMatch(Tr, X, unif.seed = 123, int.seed = 92485...)`. You can find info on these `.seed elements` in the documentation for `genoud()`. The special .seed elements should only be present in the `GenMatch()` line, not elsewhere (because nothing besides `GenMatch()` runs genoud.*

 **Note**: *When you use the `GenMatch()` function, wrap everything inside the following function `invisible(capture.output())`. This will reduce the unnecessary output produced from the GenMatch() function. For instance, you can say: `invisible(capture.output(genout_daughters <- GenMatch(...)))`*

 ```{r}
 # store the covariates 
 covs <- cbind(daughters$dems, daughters$repubs, daughters$christian, daughters$age, daughters$srvlng, daughters$demvote)

 #finding optimal weights for covariates to achieve balance 
 invisible(capture.output(daughter_genout <- GenMatch(Tr = daughters$hasgirls, X = covs,  estimand="ATT", M=1, pop.size=20, max.generations=10, wait.generations=1, unif.seed = 123, int.seed = 92485 )))

 #estimating causal effect using weights obtained above 
 daughters_mout <- Match(Y=daughters$nowtot, Tr=daughters$hasgirls, X=covs, estimand="ATT", Weight.matrix=daughter_genout)

 #to examine if balance was actually obtained 
 mb <- MatchBalance(daughters$hasgirls ~ daughters$dems + daughters$repubs + daughters$christian + daughters$age + daughters$srvlng + daughters$demvote, 
                   match.out=daughters_mout, nboots=500)

 summary(daughters_mout)
 ```
 As shown above the estimated treatment effect is 0.57692. 


 #### STEP 3

 Summarize (in 5-15 sentences) the genetic matching procedure and results, including what you matched on, what you balanced on, and what your balance results were. Provide output for MatchBalance() in the body of your submission. 

 *Procedure summary*
 First, I identified the covariates to match on (i.e., dems, repubs, christian, age, srvlng, and demvote). Second, I ran the GenMatch() function with population size of 20 to find the optimal weights for the covariates to achieve balance. Then, I called Match() to estimate the causal effect of 'hasgirls' which turned out to be 0.57692. 

 *Result summary*
 The algorithm matched 312 observations and yielded exact balance on repubs & christian and better balance on the rest of the variables. 
 - For repubs & christian: after matching, the mean treatment became exactly equal to mean control with var ratio (Tr/Co) = pval = 1 and standard mean difference = 0, which resulted in perfect balance 
 - For age, dems, srvlng, and srvlng: after matching, the var ratio between treatment and control got closer to 1 (compared to before matching), indicating better balance. For srvlng, the t-test p-value dropped to 0.54 compared to 0.85 before matching; however, the mean treatment and control values became slightly closer to each other. Similarly, dems' p-value dropped by 0.04, but yielded better match and var ratio of 0.99 (as opposed to 0.98 before matching). Overall, using genetic matching we obtained better balance on the 6 covariates. 


 #### STEP 4

 Is your treatment effect different from the one reported before matching? By how much? If your numbers are different, provide some explanation as to why the two numbers are different. If they're not, provide an explanation why they're not different.

 There is a significant difference (0.57692 - (-0.4523) =  1.02) between the treatment effect obtained from the linear regression & genetic matching. Notably, the linear regression method suggested a negative relationship between having a girl & voting in agreement with NOW, whereas the estimated positive relationship between having a girl & voting is in agreement with NOW. 

 In regression models, we are mainly concerned with minimizing the squared residuals/error while in matching we aim to match control and treatment observation on covariates such that we simulate a randomized experiment by obtaining similar distribution of covariates which makes the algorithm discard some observations. In other words, matching helps reduce the bias due to covariates, making the results more robust (that the treatment effect is due to treatment not the covariates) compared to a regression model. 


 #### STEP 5

 Change the parameters in your genetic matching algorithm to improve the balance of the covariates. Consider rerunning with M = 2 or 3 or more. Consider increasing other parameters in the `GenMatch()` function such as the number of generations and population size, caliper, etc. Try 10 different ways but don't report the results of the genetic matching weights or the balance in this document. Only report the treatment effect of your top 3 matches. For instance, run the `Match()` function three times for your top 3 genout objects. Make sure the summary reports the treatment effect estimate, the standard error, and the confidence interval. Do you see a large variation in your treatment effect between your top 3 models?

 **Note**: *For replicability purposes, we need to choose a see for the `GenMatch()` function. However, setting seed for `GenMatch()` is different. The usual practice of typing `set.seed(123)` before the GenMatch line doesn't ensure stochastic stability. To set seeds for `GenMatch`, you have to run `GenMatch()` including instructions for genoud within `GenMatch()`, e.g.: `GenMatch(Tr, X, unif.seed = 123, int.seed = 92485...)`. You can find info on these `.seed elements` in the documentation for `genoud()`. The special .seed elements should only be present in the `GenMatch()` line, not elsewhere (because nothing besides `GenMatch()` runs genoud.*

 **Note**: *When you use the `GenMatch()` function, wrap everything inside the following function `invisible(capture.output())`. This will reduce the unnecessary output produced with the GenMatch() function. For instance, you can say: `invisible(capture.output(genout_daughters <- GenMatch(...)))`*

 **Note**: *In the matching assignment, you may find that the Genetic Matching step takes a while, e.g., hours. If you have to reduce pop.size to e.g., 10 or 16 to ensure it stops after only an hour or two, that's fine. Running your computer for an hour or two is a good thing. Running it for a full day or more is unnecessary overkill (and if this is your situation, change hyperparameters like pop.size to reduce run-time). For example, we suggest you modify the pop.size (e.g., you can set it to 5!), max.generations (set it to 2!), and wait.generations (set it to 1!) and that should expedite things.*

 **Note**: *Can you set a caliper for one confounding variable, and not others (e.g., only set a caliper for "age")? No and yes. No, strictly speaking you can't. But yes, practically speaking you can, if you set other calipers (for the other confounders) that are so wide as to not induce any constraints. E.g., in GenMatch, and Match, you could set `caliper = c(1e16, 1e16, 0.5, 1e16)` and this would induce a certain meaningful caliper for the third confounder in X, without constraining the other confounders (because 1e16 implies a caliper that is so enormously wide that it does not, in practical terms, serve as a caliper at all).*
 ```{r}
 invisible(capture.output(daughter_genout1 <- GenMatch(Tr = daughters$hasgirls, X = covs,  estimand="ATT", M=1, pop.size=30 , max.generations=10, wait.generations=1, unif.seed = 123, int.seed = 92485, )))

 #estimating causal effect using weights obtained above 
 daughters_mout1 <- Match(Y=daughters$nowtot, Tr=daughters$hasgirls, X=covs, estimand="ATT", Weight.matrix=daughter_genout1, M = 1, caliper = 0.25)

 #to examine if balance was actually obtained 
 mb1 <- MatchBalance(daughters$hasgirls ~ daughters$dems + daughters$repubs + daughters$christian + daughters$age + daughters$srvlng + daughters$demvote, 
                   match.out=daughters_mout1, nboots=500)

 summary(daughters_mout1)
 #extract the treatment effect estimate 
 estimate1 = daughters_mout1$est
 #extract the standard error 
 standard_error1 = daughters_mout1$se
 #to calculate the 95% confidence interval 
 c(estimate1-1.96*standard_error1, estimate1+1.96*standard_error1)
 ```
 ```{r}
 invisible(capture.output(daughter_genout2 <- GenMatch(Tr = daughters$hasgirls, X = covs,  estimand="ATT", M=1, pop.size=300 , max.generations=300, wait.generations=4, unif.seed = 123, int.seed = 92485, )))

 #estimating causal effect using weights obtained above
 daughters_mout2 <- Match(Y=daughters$nowtot, Tr=daughters$hasgirls, X=covs, estimand="ATT", Weight.matrix=daughter_genout2, M = 1, caliper = 0.4)

 #to examine if balance was actually obtained
 mb2 <- MatchBalance(daughters$hasgirls ~ daughters$dems + daughters$repubs + daughters$christian + daughters$age + daughters$srvlng + daughters$demvote,
                   match.out=daughters_mout2, nboots=500)

 summary(daughters_mout2)

 #extract the treatment effect estimate 
 estimate2 = daughters_mout2$est
 #extract the standard error 
 standard_error2 = daughters_mout2$se
 #to calculate the 95% confidence interval 
 c(estimate2-1.96*standard_error2, estimate2+1.96*standard_error2)
 ```
 ```{r}
 invisible(capture.output(daughter_genout3 <- GenMatch(Tr = daughters$hasgirls, X = covs,  estimand="ATT", M=1, pop.size=100 , max.generations=100, wait.generations=3, unif.seed = 123, int.seed = 92485, )))

 #estimating causal effect using weights obtained above
 daughters_mout3 <- Match(Y=daughters$nowtot, Tr=daughters$hasgirls, X=covs, estimand="ATT", Weight.matrix=daughter_genout3, M = 2)

 #to examine if balance was actually obtained
 mb3 <- MatchBalance(daughters$hasgirls ~ daughters$dems + daughters$repubs + daughters$christian + daughters$age + daughters$srvlng + daughters$demvote,
                   match.out=daughters_mout3, nboots=1000)

 summary(daughters_mout3)

 #extract the treatment effect estimate 
 estimate3 = daughters_mout3$est
 #extract the standard error 
 standard_error3 = daughters_mout3$se
 #to calculate the 95% confidence interval 
 c(estimate3-1.96*standard_error3, estimate3+1.96*standard_error3)
 ```

 First: This match produced an estimated treatment effect of -0.2027 with 95% confidence interval [-1.573689, 1.168284], SE of 0.69948, and a p.val of 0.77198. However, 238 observations dropped,thus we ended up with 74 matched observations only. 

 Second: This match produced an estimated treatment effect of 0.081522 with 95% confidence interval [-2.409311, 2.572354], SE of 1.2708, and a p.val of 0.94885. 128 observations dropped, thus we ended up with 184 matched observations. 

 Third: This match produced an estimated treatment effect of 0.9375 with 95% confidence interval [-3.031504, 4.906504], SE of 2.025, and a p.val of 0.64339. No observation was dropped. 



 #### STEP 6

 Repeat everything you've done for Steps 1-2, including the regression, genetic algorithm, code and estimating the treatment effect EXCEPT this time change the definition of treatment to cover 2 girls or more, and change the definition of control to cover 2 boys or more. Exclude all observations that don't meet these requirements. Be sure to explain (in a sentence or two) what you're doing with your new treatment and control definitions. Do your new definitions change anything?

 **Note**: *Definition of the new treatment variable is as follows: Individuals in the treatment group should be having 2 or more girls and no boys, and individuals in the control group should be having 2 or more boys and no girls. What I had in mind was that such a situation increased the "dosage" of treatment vs. the "dosage" of control (and Rosenbaum alluded to this kind of observational design logic in one of the recently assigned articles). Note that you can't have the same units in the treatment group AND the control group -- we should all know by now that such a situation would be wrong.*


 ```{r}
 #filter observations that have two or more girls and no boys
 #filter observations that have two or more boys and no girls 
 #merge both subsets into a new data frame 
 daughters_modified = rbind(filter(daughters, nboys == 0, ngirls >= 2), filter(daughters, ngirls == 0, nboys >= 2))

 #specify the new treatment & control definition 
 daughters_modified$treatment_modified = ifelse(daughters_modified$ngirls == 0, 0,1)

 #show data to ensure proper assignment of the new treatment & control groups 
 daughters_modified %>%
  select(treatment_modified, ngirls, nboys)
 ```
 ```{r}
 #new regression model 
 daughters_modified_lm1 <- lm(nowtot ~ treatment_modified + dems + repubs + christian + age + srvlng + demvote,  data=daughters_modified)

 summary(daughters_modified_lm1)

 #calculate the confidence interval 
 confint(daughters_modified_lm1)[2, ]

 ```
 The new treatment effect is 12.2925 with 95% confidence interval of [5.330876, 19.254209]


 ```{r}
 #check balance before matching 
 daughters_modified_p1 <- bal.plot(daughters_modified$treatment_modified ~ daughters_modified$dems, treat = daughters_modified$treatment_modified)

 daughters_modified_p2 <- bal.plot(daughters_modified$treatment_modified ~ daughters_modified$repubs,  treat = daughters_modified$treatment_modified)

 daughters_modified_p3  <- bal.plot(daughters_modified$treatment_modified ~ daughters_modified$christian,  treat = daughters_modified$treatment_modified)

 daughters_modified_p4 <- bal.plot(daughters_modified$treatment_modified ~ daughters_modified$age,  treat = daughters_modified$treatment_modified)

 daughters_modified_p5 <- bal.plot(daughters_modified$treatment_modified ~ daughters_modified$srvlng,  treat = daughters_modified$treatment_modified)

 daughters_modified_p6 <- bal.plot(daughters_modified$treatment_modified ~ daughters_modified$demvote, treat = daughters_modified$treatment_modified)

 grid.arrange(daughters_modified_p1 , daughters_modified_p2, daughters_modified_p3, daughters_modified_p4, daughters_modified_p5, daughters_modified_p6, nrow = 2)
 ```
 ```{r}
 #combine covariates 
 daughters_modified_covs <- cbind(daughters_modified$dems, daughters_modified$repubs, daughters_modified$christian, daughters_modified$age, daughters_modified$srvlng, daughters_modified$demvote)

 #finding optimal weights for covariates to achieve balance 
 invisible(capture.output(daughters_modified_genout <- GenMatch(Tr = daughters_modified$treatment_modified, X = daughters_modified_covs ,  estimand="ATT", M=1, pop.size=20, max.generations=10, wait.generations=1, unif.seed = 123, int.seed = 92485 )))

 #estimating causal effect using weights obtained above 
 daughters_modified_mout <- Match(Y=daughters_modified$nowtot, Tr=daughters_modified$treatment_modified, X=daughters_modified_covs, estimand="ATT", Weight.matrix=daughters_modified_genout)

 #to examine if balance was actually obtained 
 daughters_modified_mb <- MatchBalance(daughters_modified$treatment_modified ~ daughters_modified$dems + daughters_modified$repubs + daughters_modified$christian + daughters_modified$age + daughters_modified$srvlng + daughters_modified$demvote, 
                   match.out=daughters_modified_mout, nboots=500)

 summary(daughters_modified_mout)
 ```
 The new estimated treatment effect after matching is  11.383, indicating a significant different compared to the treatment effect obtained from the old definition of the treatment assignment (0.57692). This result suggests that having two or more girls & no boys causes legislator to vote 11.383 points higher in favor of (NOW). 

 #### STEP 7

 It is NOT wise to match or balance on "totchi". What is the reason? Hint: You will have to look at what variables mean in the data set to be able to answer this question.
 ```{r}
 summary(daughters$totchi)
 ```

 - "totchi" represents the total number of children (including boys and girls), and thus it wouldn't be accurate to use this variable to match the units since it doesn't address the causal question of the effect of having a girl on voting liberally (rather it would be the effect of having a child on voting behavior). Also, it is not appropriate to balance on it because, for instance, having two people with 5 number of children doesn't mean they are equal, for these 5 children could be any combination of boys and girls. Thus, it wouldn't yield accurate results if we match or balance on "totchi." 

 ## QUESTION 3: COPD

 Most causal studies on the health effects of smoking are observational studies (well, for very obvious reasons). In this exercise, we are specifically after answer the following question: Does smoking increase the risk of chronic obstructive pulmonary disease (COPD)? To learn more about the disease, read here: https://www.cdc.gov/copd/index.html

 We’ll use a sub-sample of the 2015 BRFSS survey (pronounced bur-fiss), which stands for Behavioral Risk Factor Surveillance System. The data is collected through a phone survey across American citizens regarding their health-related risk behaviors and chronic health conditions. Although, the entire survey has over 400,000 records and over 300 variables, we only sample 5,000 observations and 7 variables.

 Let's load the data first and take a look at the first few rows:

 ```{r}
 brfss = read.csv("http://bit.ly/BRFSS_data") %>% 
  clean_names()

 head(brfss)
 ```

 A summary of the variables is as follows:

 - copd: Ever told you have chronic obstructive pulmonary disease (COPD)?
 - smoke: Adults who are current smokers (0 = no, 1 = yes)
 - race: Race group
 - age: age group
 - sex: gender
 - wtlbs: weight in pounds (lbs)
 - avedrnk2: During the past 30 days, when you drank, how many drinks did you drink on average?

 #### STEP 1

 Check the balance of covariates before matching using any method of your choice. You can look at balance tables, balance plots, or love plots from any package of your choice. Do you see a balance across the covariates?

 **Note**: *This is optional but you can use the `gridExtra` package and its `grid.arrange()` function to put all the 4 graphs in one 2 x 2 graph. Read more about the package and how to use it here: https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html. Set `nrow = 2`.*

 ```{r}
 brfss = splitfactor(brfss, "sex", replace = FALSE, drop.first = TRUE)
 brfss_p1 <- bal.plot(brfss , treat = brfss$smoke, var.name = "race", which = "unadjusted")
 brfss_p2 <- bal.plot(brfss, treat = brfss$smoke, var.name = "sex_Male", which = "unadjusted")
 brfss_p3 <- bal.plot(brfss , treat = brfss$smoke, var.name = "age", which = "unadjusted")
 brfss_p4 <- bal.plot(brfss , treat = brfss$smoke, var.name = "wtlbs", which = "unadjusted")
 brfss_p5 <- bal.plot(brfss , treat = brfss$smoke, var.name = "avedrnk2", which = "unadjusted")
 grid.arrange(brfss_p1, brfss_p2, brfss_p3, brfss_p4, brfss_p5, nrow = 2)
 ```
 The covariates are somewhat balanced across treatment and control except for a notably higher distribution of people aged 65+ in control compared treatment group. Similarly, the portion of people aged between 25-34 is higher represented in the treatment group (almost double the control group). 


 #### STEP 2

 Now, let's do Mahalanobis distance matching. Note that you can use the same old `Match()` function. Use all covariates in the data set to match on, however, make sure you convert all categorical variables into factor variables (Google to see how). We are going to specify `estimand = "ATT"` in the `Match()` function. What's the treatment effect after matching?


 ```{r}
 #This is the resource where I got the code for converting from categorical to factor (https://www.listendata.com/2015/05/converting-multiple-numeric-variables.html)
 brfss[sapply(brfss, is.character)] <- lapply(brfss[sapply(brfss, is.character)], as.factor)

 # combine the covariates 
 brfss_covs = cbind(brfss$race, brfss$age, brfss$sex, brfss$wtlbs, brfss$avedrnk2)
 #match
 brfss_mahal.out <- Match(Y = brfss$copd, Tr= brfss$smoke, X=brfss_covs, M=1, estimand = "ATT", Weight=2)
 #check the balance 
 brfss_mb <- MatchBalance(brfss$smoke ~ brfss$race + brfss$age + brfss$sex + brfss$wtlbs + brfss$avedrnk2, match.out = brfss_mahal.out, data = brfss, nboots = 1000)

 summary(brfss_mahal.out)
 ```
 The treatment effect after matching is 0.087135.

 #### STEP 3

 Provide a few sentences talking about the number of treated units dropped, and a few more sentences talking about the balance obtained. 

 The original number of treated observations is 762, and the matched number of observations is 762 as well, so no treated unit was dropped since we used the ATT such that the matching method is based on finding a match for each treated unit. The balance slightly improved as seen in the reduced standard mean difference of the covariates. However, overall, the balance improvement was not significant since the minimum p-value changed from 2.22e-16 to 
 1.2562e-08 only.

 #### STEP 4

 Now, let's do another Mahalanobis distance matching. Use all covariates in the data set in the propensity score estimation. However, this time make sure you specify `estimand = "ATE"` in the `Match()` function. What's the treatment effect after matching?

 ```{r}
 brfss_mahal.out2 <- Match(Y = brfss$copd, Tr= brfss$smoke, X=brfss_covs, M=1, estimand = "ATE", Weight=2)
 #check the balance 
 brfss_mb2<- MatchBalance(brfss$smoke ~ brfss$race + brfss$age + brfss$sex + brfss$wtlbs + brfss$avedrnk2, match.out = brfss_mahal.out2, data = brfss, nboots = 1000)

 summary(brfss_mahal.out2)
 ```
 The treatment effect is 0.1316

 #### STEP 5

 Are your answers in Steps 2 and 4 different? Why? What does the matching process do differently in each step? Which answer do you trust more?

 Yes, they are different. When the estimand was set to ATT, the algorithm matched 762 treatment units and discarded any unused/unneeded control units in the matching because it estimates the average treatment effect across treatment units. On the contrary, ATE estimand matched all 5000 units, giving the treatment effect across all units, not just the treated ones. We are usually interested in and concerned about the treatment effect on the units that received treatment, thus the ATT estimand would be more reliable. In addition to that, the ATE hasn't changed anything about the minimum p-value (before Matching Minimum p.value = after Matching Minimum p.value = 2.22e-16), while in ATT, the minimum pvalue slightly increased to 1.2562e-08. 

 ## BONUS QUESTION: Sensitivity Analysis

 #### STEP 1

 Use the BRFSS data set from the previous question. Now, identify the critical value of Gamma as we discussed in the class. Do it using rbounds: https://nbviewer.jupyter.org/gist/viniciusmss/a156c3f22081fb5c690cdd58658f61fa

 ```{r}
 # Your code here
 library(rbounds)  
 psens(brfss_mahal.out, Gamma=1.5, GammaInc=.1)
 ```


 #### STEP 2

 Then, write a paragraph explaining what you found. Your paragraph should include numbers obtained from your analysis and explain what those numbers mean as simply as you can. 



 # End of Assignment

 ## Final Steps

 Before finalizing your project you'll want be sure there are **comments in your code chunks** and **text outside of your code chunks** to explain what you're doing in each code chunk. These explanations are incredibly helpful for someone who doesn't code or someone unfamiliar to your project.

 You have two options for submission:

 1. You can complete this .rmd file, knit it to pdf and submit the resulting .pdf file on Forum.
 2. You can complete the Google Doc version of this assignment, include your code, graphs, results, and your explanations wherever necessary and download the Google Doc as a pdf file and submit the pdf file on Forum. If you choose this method, you need to make sure you will provide a link to an .R script file where you code can be found (you can host your code on Github or Google Drive). Note that links to Google Docs are not accepted as your final submission.


 ### Knitting your R Markdown Document

 Last but not least, you'll want to **Knit your .Rmd document into a pdf document**. If you get an error, take a look at what the error says and edit your .Rmd document. Then, try to Knit again! Troubleshooting these error messages will teach you a lot about coding in R. If you get any error that doesn't make sense to you, post it on Perusall.


 Good Luck! The CS112 Teaching Team