Last active
May 28, 2020 22:53
-
-
Save explodecomputer/d4f90f8804e1db9a5a0c30ec5ad018c1 to your computer and use it in GitHub Desktop.
Predictions when training and testing subsets are selected based on a collider
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: Prediction when testing and training are stratified by a collider | |
--- | |
## Background | |
In Menni et al 2020 they look for risk factors that associate with testing positive. They then create a model to predict test status in untested individuals using those risk factors. | |
The risk factors, and testing positive, both influence whether individuals are tested. Therefore, the associations in the tested sample are likely biased due to colliders, and not transportable to those in the untested sample. | |
How does this impact the prediction accuracy of the test? | |
## Simulation | |
```{r} | |
library(dplyr) | |
library(pROC) | |
library(ggplot2) | |
library(tidyr) | |
``` | |
This is the basic model | |
- Create some risk factors (x) that influence a binary outcome (y) | |
- Whether individuals are in the training (i.e. no-covid-test) or testing (i.e. has-covid-test) sample is influenced by x and y | |
- Create predictors in the has-covid-test sample | |
- Predict in the no-covid-test sample | |
- Estimate prediction accuracy using AUC | |
To compare to a no-collider scenario: | |
- Do the same as above, except we randomly split the sample into training and testing subsets. There is no collider bias now. | |
```{r} | |
fn <- function(n, p, vbeta) | |
{ | |
# Generate | |
# p normally distributed risk factors | |
# for n samples | |
x <- matrix(rnorm(n*p), n, p) %>% as.data.frame() | |
# sample some causal effects on the outcome | |
eff <- rnorm(p, sd=vbeta) | |
# generate a binary outcome from the risk factors | |
y <- rbinom(n, 1, plogis(as.matrix(x) %*% eff)) | |
# the risk factors and outcome influence sample selection | |
sel <- rbinom(n, 1, plogis(as.matrix(x) %*% eff + y)) | |
# In model 1 we use the selected samples to train and predict in the unselected samples | |
mod1 <- glm(y ~ ., data=x, subset=sel==1, family="binomial") | |
pred1 <- predict(mod1, x[sel==0,], type="response") | |
proc1 <- roc(y[sel==0],pred1) | |
# In model 2 we randomly divide into testing and training | |
random_index <- rbinom(n, 1, sum(sel)/length(sel)) | |
mod2 <- glm(y ~ ., data=x, subset=random_index==1, family="binomial") | |
pred2 <- predict(mod2, x[random_index==0,], type="response") | |
proc2 <- roc(y[random_index==0],pred2, quiet=TRUE) | |
return(tibble( | |
auc=c(proc1$auc, proc2$auc), | |
prev=c(mean(pred1), mean(pred2)), | |
pred=c(sum(pred1>0.5)/length(pred1), sum(pred2>0.5)/length(pred2)), | |
stratification=c("collider", "random") | |
)) | |
} | |
``` | |
Run simulations | |
```{r} | |
set.seed(1234) | |
n <- 100000 | |
p <- 5 | |
vbeta <- 1 | |
nsim <- 20 | |
res <- list() | |
for(i in 1:nsim) | |
{ | |
res[[i]] <- suppressMessages(fn(n, p, vbeta)) | |
} | |
res <- bind_rows(res) | |
``` | |
Compare stratification on collider vs random stratification | |
AUC: | |
```{r} | |
ggplot(res, aes(y=auc, x=stratification)) + | |
geom_boxplot(aes(fill=stratification)) | |
``` | |
Also look at the prevalence and proportion declared infected. In the simulations the truth is 50% in the population. | |
Prevalence: | |
```{r} | |
ggplot(res, aes(y=prev, x=stratification)) + | |
geom_boxplot(aes(fill=stratification)) | |
``` | |
Infected: | |
```{r} | |
ggplot(res, aes(y=pred, x=stratification)) + | |
geom_boxplot(aes(fill=stratification)) | |
``` | |
## Summary | |
The prediction accuracy drops when your sample is stratified into training and testing subsets by a collider. In the case of Menni at al 2020 - that is basically what is happening. Having been tested is influenced by both the risk factors and the outcome (COVID-19 infection). The model is trained only in those selected to be tested, and applied to those only who are untested. The weights in the tested are distorted in a different way to the relationships in the untested. | |
Because the prediction accuracy can only be evaluated in those with the same sample selection, it will appear to have better performance than what will truly be the case when it is applied to those who haven't had the test. | |
Numbers of infected in the untested sample, and prevalence estimates in the untested sample, based on these predictions will be skewed also. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment