explodecomputer · May 28, 2020 22:53
diff --git a/covid_collider_prediction.rmd b/covid_collider_prediction.rmd
 ---
 title: Prediction when testing and training are stratified by a collider
 ---

 ## Background

 In Menni et al 2020 they look for risk factors that associate with testing positive. They then create a model to predict test status in untested individuals using those risk factors.

 The risk factors, and testing positive, both influence whether individuals are tested. Therefore, the associations in the tested sample are likely biased due to colliders, and not transportable to those in the untested sample.

 How does this impact the prediction accuracy of the test?

 ## Simulation

 ```{r}
 library(dplyr)
 library(pROC)
 library(ggplot2)
 library(tidyr)
 ```

 This is the basic model 

 - Create some risk factors (x) that influence a binary outcome (y)
 - Whether individuals are in the training (i.e. no-covid-test) or testing (i.e. has-covid-test) sample is influenced by x and y
 - Create predictors in the has-covid-test sample
 - Predict in the no-covid-test sample
 - Estimate prediction accuracy using AUC

 To compare to a no-collider scenario:

 - Do the same as above, except we randomly split the sample into training and testing subsets. There is no collider bias now.


 ```{r}
 fn <- function(n, p, vbeta)
 {
 	# Generate 
 	# p normally distributed risk factors 
 	# for n samples
 	x <- matrix(rnorm(n*p), n, p) %>% as.data.frame()

 	# sample some causal effects on the outcome
 	eff <- rnorm(p, sd=vbeta)

 	# generate a binary outcome from the risk factors
 	y <- rbinom(n, 1, plogis(as.matrix(x) %*% eff))

 	# the risk factors and outcome influence sample selection
 	sel <- rbinom(n, 1, plogis(as.matrix(x) %*% eff + y))

 	# In model 1 we use the selected samples to train and predict in the unselected samples
 	mod1 <- glm(y ~ ., data=x, subset=sel==1, family="binomial")
 	pred1 <- predict(mod1, x[sel==0,], type="response")
 	proc1 <- roc(y[sel==0],pred1)

 	# In model 2 we randomly divide into testing and training
 	random_index <- rbinom(n, 1, sum(sel)/length(sel))
 	mod2 <- glm(y ~ ., data=x, subset=random_index==1, family="binomial")
 	pred2 <- predict(mod2, x[random_index==0,], type="response")
 	proc2 <- roc(y[random_index==0],pred2, quiet=TRUE)
 	return(tibble(
 		auc=c(proc1$auc, proc2$auc),
 		prev=c(mean(pred1), mean(pred2)),
 		pred=c(sum(pred1>0.5)/length(pred1), sum(pred2>0.5)/length(pred2)),
 		stratification=c("collider", "random")
 	))
 }

 ```

 Run simulations

 ```{r}
 set.seed(1234)
 n <- 100000
 p <- 5
 vbeta <- 1
 nsim <- 20
 res <- list()
 for(i in 1:nsim)
 {
 	res[[i]] <- suppressMessages(fn(n, p, vbeta))
 }
 res <- bind_rows(res)
 ```

 Compare stratification on collider vs random stratification

 AUC:

 ```{r}
 ggplot(res, aes(y=auc, x=stratification)) +
 geom_boxplot(aes(fill=stratification))
 ```

 Also look at the prevalence and proportion declared infected. In the simulations the truth is 50% in the population.

 Prevalence:

 ```{r}
 ggplot(res, aes(y=prev, x=stratification)) +
 geom_boxplot(aes(fill=stratification))
 ```

 Infected:

 ```{r}
 ggplot(res, aes(y=pred, x=stratification)) +
 geom_boxplot(aes(fill=stratification))
 ```


 ## Summary

 The prediction accuracy drops when your sample is stratified into training and testing subsets by a collider. In the case of Menni at al 2020 - that is basically what is happening. Having been tested is influenced by both the risk factors and the outcome (COVID-19 infection). The model is trained only in those selected to be tested, and applied to those only who are untested. The weights in the tested are distorted in a different way to the relationships in the untested.

 Because the prediction accuracy can only be evaluated in those with the same sample selection, it will appear to have better performance than what will truly be the case when it is applied to those who haven't had the test.

 Numbers of infected in the untested sample, and prevalence estimates in the untested sample, based on these predictions will be skewed also.
	---
	title: Prediction when testing and training are stratified by a collider
	---

	## Background

	In Menni et al 2020 they look for risk factors that associate with testing positive. They then create a model to predict test status in untested individuals using those risk factors.

	The risk factors, and testing positive, both influence whether individuals are tested. Therefore, the associations in the tested sample are likely biased due to colliders, and not transportable to those in the untested sample.

	How does this impact the prediction accuracy of the test?

	## Simulation

	```{r}
	library(dplyr)
	library(pROC)
	library(ggplot2)
	library(tidyr)
	```

	This is the basic model

	- Create some risk factors (x) that influence a binary outcome (y)
	- Whether individuals are in the training (i.e. no-covid-test) or testing (i.e. has-covid-test) sample is influenced by x and y
	- Create predictors in the has-covid-test sample
	- Predict in the no-covid-test sample
	- Estimate prediction accuracy using AUC

	To compare to a no-collider scenario:

	- Do the same as above, except we randomly split the sample into training and testing subsets. There is no collider bias now.


	```{r}
	fn <- function(n, p, vbeta)
	{
	# Generate
	# p normally distributed risk factors
	# for n samples
	x <- matrix(rnorm(n*p), n, p) %>% as.data.frame()

	# sample some causal effects on the outcome
	eff <- rnorm(p, sd=vbeta)

	# generate a binary outcome from the risk factors
	y <- rbinom(n, 1, plogis(as.matrix(x) %*% eff))

	# the risk factors and outcome influence sample selection
	sel <- rbinom(n, 1, plogis(as.matrix(x) %*% eff + y))

	# In model 1 we use the selected samples to train and predict in the unselected samples
	mod1 <- glm(y ~ ., data=x, subset=sel==1, family="binomial")
	pred1 <- predict(mod1, x[sel==0,], type="response")
	proc1 <- roc(y[sel==0],pred1)

	# In model 2 we randomly divide into testing and training
	random_index <- rbinom(n, 1, sum(sel)/length(sel))
	mod2 <- glm(y ~ ., data=x, subset=random_index==1, family="binomial")
	pred2 <- predict(mod2, x[random_index==0,], type="response")
	proc2 <- roc(y[random_index==0],pred2, quiet=TRUE)
	return(tibble(
	auc=c(proc1$auc, proc2$auc),
	prev=c(mean(pred1), mean(pred2)),
	pred=c(sum(pred1>0.5)/length(pred1), sum(pred2>0.5)/length(pred2)),
	stratification=c("collider", "random")
	))
	}

	```

	Run simulations

	```{r}
	set.seed(1234)
	n <- 100000
	p <- 5
	vbeta <- 1
	nsim <- 20
	res <- list()
	for(i in 1:nsim)
	{
	res[[i]] <- suppressMessages(fn(n, p, vbeta))
	}
	res <- bind_rows(res)
	```

	Compare stratification on collider vs random stratification

	AUC:

	```{r}
	ggplot(res, aes(y=auc, x=stratification)) +
	geom_boxplot(aes(fill=stratification))
	```

	Also look at the prevalence and proportion declared infected. In the simulations the truth is 50% in the population.

	Prevalence:

	```{r}
	ggplot(res, aes(y=prev, x=stratification)) +
	geom_boxplot(aes(fill=stratification))
	```

	Infected:

	```{r}
	ggplot(res, aes(y=pred, x=stratification)) +
	geom_boxplot(aes(fill=stratification))
	```


	## Summary

	The prediction accuracy drops when your sample is stratified into training and testing subsets by a collider. In the case of Menni at al 2020 - that is basically what is happening. Having been tested is influenced by both the risk factors and the outcome (COVID-19 infection). The model is trained only in those selected to be tested, and applied to those only who are untested. The weights in the tested are distorted in a different way to the relationships in the untested.

	Because the prediction accuracy can only be evaluated in those with the same sample selection, it will appear to have better performance than what will truly be the case when it is applied to those who haven't had the test.

	Numbers of infected in the untested sample, and prevalence estimates in the untested sample, based on these predictions will be skewed also.