mice-pmm-puzzle.md

mice.impute.pmm() can give counterintuitive results when using type 1 matching, see for example the following scenario.

The following code is provided by Stef van Buuren, inspired by Templ 2023, Visualization and Imputation of Missing Values.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
data(ethanol, package = "lattice")
ethanol <- ethanol[,c(1,3)]
ethanol_na <- ethanol
set.seed(12)
ethanol_na$NOx[sample(1:nrow(ethanol_na), 10)] <- NA

imp <- mice(ethanol_na, m = 1, method = "pmm")
#> 
#>  iter imp variable
#>   1   1  NOx
#>   2   1  NOx
#>   3   1  NOx
#>   4   1  NOx
#>   5   1  NOx
cda <- complete(imp)
plot(NOx ~ E, data = ethanol_na, col = mdc(1), pch = 20, cex = 1.5)
w <- is.na(ethanol_na$NOx)
points(cda$E[w], cda$NOx[w], col = mdc(2), pch = 20, cex = 1.5)
points(ethanol$E[w], ethanol$NOx[w], col = mdc(3), pch = 20, cex = 1.5)
segments(x0 = ethanol$E[w], x1 = ethanol$E[w],
         y0 = ethanol$NOx[w], y1 = cda$NOx[w], col = mdc(3), lty = 2)
mtext("PMM Puzzle")

# Colors: Blue = observed, Grey = masked, Red = imputed

The following code is my own attempt to gain some insight into why this counter-intuitive finding occured.

## PMM Puzzle solution

pred_x <- matrix(seq(0.6, 1.2, length.out = 100), dimnames = list(NULL, "E"))

mlb <- c(2.5568626, -0.6440363)
draw <- c(2.3035742, -0.4593925)

pred_y_ml <- cbind(1, pred_x) %*% mlb
pred_y_draw <- cbind(1, pred_x) %*% draw

o_ml <- order(pred_y_ml)
o_draw <- order(pred_y_draw)

lines(pred_x[o_ml, ], pred_y_ml[o_ml], col = mdc(1), lwd=2)
lines(pred_x[o_draw, ], pred_y_draw[o_draw], col = mdc(2), lwd=2)

targetid <- which(w)[order(ethanol_na$E[w])[2]]
target <- ethanol_na[targetid, ,drop=F] |> as.matrix()

points(target[,2][[1]], ethanol[targetid,1], col="orange", pch = 17, cex=1.5)
arrows(target[,2][[1]]+0.05, ethanol[targetid, 1] - 0.3,
       target[,2][[1]], ethanol[targetid, 1], length = 0.1)

target_y <- cbind(1, target[,-1,drop=F]) %*% draw
donors <- cbind(1, ethanol[!w, -1]) |> as.matrix() %*% mlb

matches <- sapply(1:100, \(i) matchindex(donors, target_y)) |> unique()
points(ethanol[!w,][matches,c("E", "NOx")],
       col = "salmon", pch=18, cex=1.5)
lines(matrix(c(cda[targetid,c("E", "NOx")],
               ethanol[!w,][matches[5], c("E", "NOx")]),
             nrow=2, byrow=T))

# Calculate target y_pred
ypred <- t(c(1, ethanol[targetid,2])) %*% draw
x_mlb <- (ypred - mlb[1]) / mlb[2]
y_mlb <- t(c(1, x_mlb)) %*% mlb

arrows(ethanol[targetid, 2], ypred,
       x_mlb, y_mlb)

legend(
  "topright", 
  c("Observed", "Imputed", "Missing", "Target", "Donors"),
  col = c(mdc(1:3), "orange", "salmon"),
  pch = c(16, 16, 16, 17, 18)
)

Notably, by using the maximum likelihood predictions (the blue line) and matching these with predictions based on the posterior draw for the missing cases, we basically search in a different subregion. Namely, we now search around the place where the larger black arrow touches the blue line.

^{Created on 2024-07-29 with reprex v2.0.2}

I'm not sure whether this is behavior we should try to evade, @stefvanbuuren.

As per your handbook section on pmm(), introducing this type-1 matching is necessary to appropriately account for uncertainty on the true value (I did not dive into the details about this, I just assumed this is correct).
pmm() is more or less a linear method with some nifty adaptations to make it more robust against model misspecification, but is still sensitive to this linearity assumption, although the assumption is maybe more on approximate linearity or monotonicity.

When imposing a model that does not fit the data very well, it is perhaps even desirably that we introduce this larger uncertainty, because the model has barely learned anything from the data, and for all it knows, the data just presents random noise around some intercept.

Do we then need to (a) adjust pmm() to make it more flexible (i.e., let it adapt to different functional forms), or (b) use another imputation method that is fully non-parametric (and thus allows for any functional form), although these have downsides as well (potential overfitting, perhaps slower)?

thomvolker/mice-pmm-puzzle.md

thomvolker commented Jul 30, 2024

Uh oh!

stefvanbuuren commented Aug 7, 2024

Uh oh!