mice-pmm-puzzle.md

mice.impute.pmm() can give counterintuitive results when using type 1 matching, see for example the following scenario.

The following code is provided by Stef van Buuren, inspired by Templ 2023, Visualization and Imputation of Missing Values.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
data(ethanol, package = "lattice")
ethanol <- ethanol[,c(1,3)]
ethanol_na <- ethanol
set.seed(12)
ethanol_na$NOx[sample(1:nrow(ethanol_na), 10)] <- NA

imp <- mice(ethanol_na, m = 1, method = "pmm")
#> 
#>  iter imp variable
#>   1   1  NOx
#>   2   1  NOx
#>   3   1  NOx
#>   4   1  NOx
#>   5   1  NOx
cda <- complete(imp)
plot(NOx ~ E, data = ethanol_na, col = mdc(1), pch = 20, cex = 1.5)
w <- is.na(ethanol_na$NOx)
points(cda$E[w], cda$NOx[w], col = mdc(2), pch = 20, cex = 1.5)
points(ethanol$E[w], ethanol$NOx[w], col = mdc(3), pch = 20, cex = 1.5)
segments(x0 = ethanol$E[w], x1 = ethanol$E[w],
         y0 = ethanol$NOx[w], y1 = cda$NOx[w], col = mdc(3), lty = 2)
mtext("PMM Puzzle")

# Colors: Blue = observed, Grey = masked, Red = imputed

The following code is my own attempt to gain some insight into why this counter-intuitive finding occured.

## PMM Puzzle solution

pred_x <- matrix(seq(0.6, 1.2, length.out = 100), dimnames = list(NULL, "E"))

mlb <- c(2.5568626, -0.6440363)
draw <- c(2.3035742, -0.4593925)

pred_y_ml <- cbind(1, pred_x) %*% mlb
pred_y_draw <- cbind(1, pred_x) %*% draw

o_ml <- order(pred_y_ml)
o_draw <- order(pred_y_draw)

lines(pred_x[o_ml, ], pred_y_ml[o_ml], col = mdc(1), lwd=2)
lines(pred_x[o_draw, ], pred_y_draw[o_draw], col = mdc(2), lwd=2)

targetid <- which(w)[order(ethanol_na$E[w])[2]]
target <- ethanol_na[targetid, ,drop=F] |> as.matrix()

points(target[,2][[1]], ethanol[targetid,1], col="orange", pch = 17, cex=1.5)
arrows(target[,2][[1]]+0.05, ethanol[targetid, 1] - 0.3,
       target[,2][[1]], ethanol[targetid, 1], length = 0.1)

target_y <- cbind(1, target[,-1,drop=F]) %*% draw
donors <- cbind(1, ethanol[!w, -1]) |> as.matrix() %*% mlb

matches <- sapply(1:100, \(i) matchindex(donors, target_y)) |> unique()
points(ethanol[!w,][matches,c("E", "NOx")],
       col = "salmon", pch=18, cex=1.5)
lines(matrix(c(cda[targetid,c("E", "NOx")],
               ethanol[!w,][matches[5], c("E", "NOx")]),
             nrow=2, byrow=T))

# Calculate target y_pred
ypred <- t(c(1, ethanol[targetid,2])) %*% draw
x_mlb <- (ypred - mlb[1]) / mlb[2]
y_mlb <- t(c(1, x_mlb)) %*% mlb

arrows(ethanol[targetid, 2], ypred,
       x_mlb, y_mlb)

legend(
  "topright", 
  c("Observed", "Imputed", "Missing", "Target", "Donors"),
  col = c(mdc(1:3), "orange", "salmon"),
  pch = c(16, 16, 16, 17, 18)
)

Notably, by using the maximum likelihood predictions (the blue line) and matching these with predictions based on the posterior draw for the missing cases, we basically search in a different subregion. Namely, we now search around the place where the larger black arrow touches the blue line.

^{Created on 2024-07-29 with reprex v2.0.2}

thomvolker/mice-pmm-puzzle.md

stefvanbuuren commented Aug 7, 2024

Uh oh!