Skip to content

Instantly share code, notes, and snippets.

@thomvolker
Last active August 7, 2024 13:51
Show Gist options
  • Save thomvolker/323ce169c5c72f799a89268f68df90b9 to your computer and use it in GitHub Desktop.
Save thomvolker/323ce169c5c72f799a89268f68df90b9 to your computer and use it in GitHub Desktop.

mice.impute.pmm() can give counterintuitive results when using type 1 matching, see for example the following scenario.

The following code is provided by Stef van Buuren, inspired by Templ 2023, Visualization and Imputation of Missing Values.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
data(ethanol, package = "lattice")
ethanol <- ethanol[,c(1,3)]
ethanol_na <- ethanol
set.seed(12)
ethanol_na$NOx[sample(1:nrow(ethanol_na), 10)] <- NA

imp <- mice(ethanol_na, m = 1, method = "pmm")
#> 
#>  iter imp variable
#>   1   1  NOx
#>   2   1  NOx
#>   3   1  NOx
#>   4   1  NOx
#>   5   1  NOx
cda <- complete(imp)
plot(NOx ~ E, data = ethanol_na, col = mdc(1), pch = 20, cex = 1.5)
w <- is.na(ethanol_na$NOx)
points(cda$E[w], cda$NOx[w], col = mdc(2), pch = 20, cex = 1.5)
points(ethanol$E[w], ethanol$NOx[w], col = mdc(3), pch = 20, cex = 1.5)
segments(x0 = ethanol$E[w], x1 = ethanol$E[w],
         y0 = ethanol$NOx[w], y1 = cda$NOx[w], col = mdc(3), lty = 2)
mtext("PMM Puzzle")

# Colors: Blue = observed, Grey = masked, Red = imputed

The following code is my own attempt to gain some insight into why this counter-intuitive finding occured.

## PMM Puzzle solution

pred_x <- matrix(seq(0.6, 1.2, length.out = 100), dimnames = list(NULL, "E"))

mlb <- c(2.5568626, -0.6440363)
draw <- c(2.3035742, -0.4593925)

pred_y_ml <- cbind(1, pred_x) %*% mlb
pred_y_draw <- cbind(1, pred_x) %*% draw

o_ml <- order(pred_y_ml)
o_draw <- order(pred_y_draw)

lines(pred_x[o_ml, ], pred_y_ml[o_ml], col = mdc(1), lwd=2)
lines(pred_x[o_draw, ], pred_y_draw[o_draw], col = mdc(2), lwd=2)

targetid <- which(w)[order(ethanol_na$E[w])[2]]
target <- ethanol_na[targetid, ,drop=F] |> as.matrix()

points(target[,2][[1]], ethanol[targetid,1], col="orange", pch = 17, cex=1.5)
arrows(target[,2][[1]]+0.05, ethanol[targetid, 1] - 0.3,
       target[,2][[1]], ethanol[targetid, 1], length = 0.1)

target_y <- cbind(1, target[,-1,drop=F]) %*% draw
donors <- cbind(1, ethanol[!w, -1]) |> as.matrix() %*% mlb

matches <- sapply(1:100, \(i) matchindex(donors, target_y)) |> unique()
points(ethanol[!w,][matches,c("E", "NOx")],
       col = "salmon", pch=18, cex=1.5)
lines(matrix(c(cda[targetid,c("E", "NOx")],
               ethanol[!w,][matches[5], c("E", "NOx")]),
             nrow=2, byrow=T))

# Calculate target y_pred
ypred <- t(c(1, ethanol[targetid,2])) %*% draw
x_mlb <- (ypred - mlb[1]) / mlb[2]
y_mlb <- t(c(1, x_mlb)) %*% mlb

arrows(ethanol[targetid, 2], ypred,
       x_mlb, y_mlb)

legend(
  "topright", 
  c("Observed", "Imputed", "Missing", "Target", "Donors"),
  col = c(mdc(1:3), "orange", "salmon"),
  pch = c(16, 16, 16, 17, 18)
)

Notably, by using the maximum likelihood predictions (the blue line) and matching these with predictions based on the posterior draw for the missing cases, we basically search in a different subregion. Namely, we now search around the place where the larger black arrow touches the blue line.

Created on 2024-07-29 with reprex v2.0.2

@stefvanbuuren
Copy link

Yes, you're right. We shouldn't be diverted by an edge case and should continue emphasizing the use of reasonable imputation models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment