Skip to content

Instantly share code, notes, and snippets.

@thomvolker
Last active August 7, 2024 13:51
Show Gist options
  • Save thomvolker/323ce169c5c72f799a89268f68df90b9 to your computer and use it in GitHub Desktop.
Save thomvolker/323ce169c5c72f799a89268f68df90b9 to your computer and use it in GitHub Desktop.

mice.impute.pmm() can give counterintuitive results when using type 1 matching, see for example the following scenario.

The following code is provided by Stef van Buuren, inspired by Templ 2023, Visualization and Imputation of Missing Values.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
data(ethanol, package = "lattice")
ethanol <- ethanol[,c(1,3)]
ethanol_na <- ethanol
set.seed(12)
ethanol_na$NOx[sample(1:nrow(ethanol_na), 10)] <- NA

imp <- mice(ethanol_na, m = 1, method = "pmm")
#> 
#>  iter imp variable
#>   1   1  NOx
#>   2   1  NOx
#>   3   1  NOx
#>   4   1  NOx
#>   5   1  NOx
cda <- complete(imp)
plot(NOx ~ E, data = ethanol_na, col = mdc(1), pch = 20, cex = 1.5)
w <- is.na(ethanol_na$NOx)
points(cda$E[w], cda$NOx[w], col = mdc(2), pch = 20, cex = 1.5)
points(ethanol$E[w], ethanol$NOx[w], col = mdc(3), pch = 20, cex = 1.5)
segments(x0 = ethanol$E[w], x1 = ethanol$E[w],
         y0 = ethanol$NOx[w], y1 = cda$NOx[w], col = mdc(3), lty = 2)
mtext("PMM Puzzle")

# Colors: Blue = observed, Grey = masked, Red = imputed

The following code is my own attempt to gain some insight into why this counter-intuitive finding occured.

## PMM Puzzle solution

pred_x <- matrix(seq(0.6, 1.2, length.out = 100), dimnames = list(NULL, "E"))

mlb <- c(2.5568626, -0.6440363)
draw <- c(2.3035742, -0.4593925)

pred_y_ml <- cbind(1, pred_x) %*% mlb
pred_y_draw <- cbind(1, pred_x) %*% draw

o_ml <- order(pred_y_ml)
o_draw <- order(pred_y_draw)

lines(pred_x[o_ml, ], pred_y_ml[o_ml], col = mdc(1), lwd=2)
lines(pred_x[o_draw, ], pred_y_draw[o_draw], col = mdc(2), lwd=2)

targetid <- which(w)[order(ethanol_na$E[w])[2]]
target <- ethanol_na[targetid, ,drop=F] |> as.matrix()

points(target[,2][[1]], ethanol[targetid,1], col="orange", pch = 17, cex=1.5)
arrows(target[,2][[1]]+0.05, ethanol[targetid, 1] - 0.3,
       target[,2][[1]], ethanol[targetid, 1], length = 0.1)

target_y <- cbind(1, target[,-1,drop=F]) %*% draw
donors <- cbind(1, ethanol[!w, -1]) |> as.matrix() %*% mlb

matches <- sapply(1:100, \(i) matchindex(donors, target_y)) |> unique()
points(ethanol[!w,][matches,c("E", "NOx")],
       col = "salmon", pch=18, cex=1.5)
lines(matrix(c(cda[targetid,c("E", "NOx")],
               ethanol[!w,][matches[5], c("E", "NOx")]),
             nrow=2, byrow=T))

# Calculate target y_pred
ypred <- t(c(1, ethanol[targetid,2])) %*% draw
x_mlb <- (ypred - mlb[1]) / mlb[2]
y_mlb <- t(c(1, x_mlb)) %*% mlb

arrows(ethanol[targetid, 2], ypred,
       x_mlb, y_mlb)

legend(
  "topright", 
  c("Observed", "Imputed", "Missing", "Target", "Donors"),
  col = c(mdc(1:3), "orange", "salmon"),
  pch = c(16, 16, 16, 17, 18)
)

Notably, by using the maximum likelihood predictions (the blue line) and matching these with predictions based on the posterior draw for the missing cases, we basically search in a different subregion. Namely, we now search around the place where the larger black arrow touches the blue line.

Created on 2024-07-29 with reprex v2.0.2

@stefvanbuuren
Copy link

CONGRATULATIONS! You cracked it. Your prize: DEEP INSIGHT into PMM!

@stefvanbuuren
Copy link

Any ideas on how to evade the odd behavior?

@stefvanbuuren
Copy link

stefvanbuuren commented Jul 30, 2024

Should the colors of the two lines be reversed?

I now see they are OK.

@thomvolker
Copy link
Author

I'm not sure whether this is behavior we should try to evade, @stefvanbuuren.

  1. As per your handbook section on pmm(), introducing this type-1 matching is necessary to appropriately account for uncertainty on the true value (I did not dive into the details about this, I just assumed this is correct).
  2. pmm() is more or less a linear method with some nifty adaptations to make it more robust against model misspecification, but is still sensitive to this linearity assumption, although the assumption is maybe more on approximate linearity or monotonicity.

When imposing a model that does not fit the data very well, it is perhaps even desirably that we introduce this larger uncertainty, because the model has barely learned anything from the data, and for all it knows, the data just presents random noise around some intercept.

Do we then need to (a) adjust pmm() to make it more flexible (i.e., let it adapt to different functional forms), or (b) use another imputation method that is fully non-parametric (and thus allows for any functional form), although these have downsides as well (potential overfitting, perhaps slower)?

@stefvanbuuren
Copy link

Yes, you're right. We shouldn't be diverted by an edge case and should continue emphasizing the use of reasonable imputation models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment