Skip to content

Instantly share code, notes, and snippets.

@explodecomputer
Last active May 28, 2020 22:53
Show Gist options
  • Save explodecomputer/d4f90f8804e1db9a5a0c30ec5ad018c1 to your computer and use it in GitHub Desktop.
Save explodecomputer/d4f90f8804e1db9a5a0c30ec5ad018c1 to your computer and use it in GitHub Desktop.
Predictions when training and testing subsets are selected based on a collider
---
title: Prediction when testing and training are stratified by a collider
---
## Background
In Menni et al 2020 they look for risk factors that associate with testing positive. They then create a model to predict test status in untested individuals using those risk factors.
The risk factors, and testing positive, both influence whether individuals are tested. Therefore, the associations in the tested sample are likely biased due to colliders, and not transportable to those in the untested sample.
How does this impact the prediction accuracy of the test?
## Simulation
```{r}
library(dplyr)
library(pROC)
library(ggplot2)
library(tidyr)
```
This is the basic model
- Create some risk factors (x) that influence a binary outcome (y)
- Whether individuals are in the training (i.e. no-covid-test) or testing (i.e. has-covid-test) sample is influenced by x and y
- Create predictors in the has-covid-test sample
- Predict in the no-covid-test sample
- Estimate prediction accuracy using AUC
To compare to a no-collider scenario:
- Do the same as above, except we randomly split the sample into training and testing subsets. There is no collider bias now.
```{r}
fn <- function(n, p, vbeta)
{
# Generate
# p normally distributed risk factors
# for n samples
x <- matrix(rnorm(n*p), n, p) %>% as.data.frame()
# sample some causal effects on the outcome
eff <- rnorm(p, sd=vbeta)
# generate a binary outcome from the risk factors
y <- rbinom(n, 1, plogis(as.matrix(x) %*% eff))
# the risk factors and outcome influence sample selection
sel <- rbinom(n, 1, plogis(as.matrix(x) %*% eff + y))
# In model 1 we use the selected samples to train and predict in the unselected samples
mod1 <- glm(y ~ ., data=x, subset=sel==1, family="binomial")
pred1 <- predict(mod1, x[sel==0,], type="response")
proc1 <- roc(y[sel==0],pred1)
# In model 2 we randomly divide into testing and training
random_index <- rbinom(n, 1, sum(sel)/length(sel))
mod2 <- glm(y ~ ., data=x, subset=random_index==1, family="binomial")
pred2 <- predict(mod2, x[random_index==0,], type="response")
proc2 <- roc(y[random_index==0],pred2, quiet=TRUE)
return(tibble(
auc=c(proc1$auc, proc2$auc),
prev=c(mean(pred1), mean(pred2)),
pred=c(sum(pred1>0.5)/length(pred1), sum(pred2>0.5)/length(pred2)),
stratification=c("collider", "random")
))
}
```
Run simulations
```{r}
set.seed(1234)
n <- 100000
p <- 5
vbeta <- 1
nsim <- 20
res <- list()
for(i in 1:nsim)
{
res[[i]] <- suppressMessages(fn(n, p, vbeta))
}
res <- bind_rows(res)
```
Compare stratification on collider vs random stratification
AUC:
```{r}
ggplot(res, aes(y=auc, x=stratification)) +
geom_boxplot(aes(fill=stratification))
```
Also look at the prevalence and proportion declared infected. In the simulations the truth is 50% in the population.
Prevalence:
```{r}
ggplot(res, aes(y=prev, x=stratification)) +
geom_boxplot(aes(fill=stratification))
```
Infected:
```{r}
ggplot(res, aes(y=pred, x=stratification)) +
geom_boxplot(aes(fill=stratification))
```
## Summary
The prediction accuracy drops when your sample is stratified into training and testing subsets by a collider. In the case of Menni at al 2020 - that is basically what is happening. Having been tested is influenced by both the risk factors and the outcome (COVID-19 infection). The model is trained only in those selected to be tested, and applied to those only who are untested. The weights in the tested are distorted in a different way to the relationships in the untested.
Because the prediction accuracy can only be evaluated in those with the same sample selection, it will appear to have better performance than what will truly be the case when it is applied to those who haven't had the test.
Numbers of infected in the untested sample, and prevalence estimates in the untested sample, based on these predictions will be skewed also.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment