Last active
May 4, 2016 09:54
-
-
Save explodecomputer/9dde3550f5c567106844 to your computer and use it in GitHub Desktop.
smoking and age simulation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@article{Doll2004, | |
author = {Doll, R.}, | |
doi = {10.1136/bmj.38142.554479.AE}, | |
file = {:Users/explodecomputer/papers/1519.full.pdf:pdf}, | |
issn = {0959-8138}, | |
journal = {Bmj}, | |
number = {7455}, | |
pages = {1519--0}, | |
title = {{Mortality in relation to smoking: 50 years' observations on male British doctors}}, | |
url = {http://www.bmj.com/cgi/doi/10.1136/bmj.38142.554479.AE}, | |
volume = {328}, | |
year = {2004}, | |
} | |
@article{Shiffman2008, | |
abstract = {Assessment in clinical psychology typically relies on global retrospective self-reports collected at research or clinic visits, which are limited by recall bias and are not well suited to address how behavior changes over time and across contexts. Ecological momentary assessment (EMA) involves repeated sampling of subjects' current behaviors and experiences in real time, in subjects' natural environments. EMA aims to minimize recall bias, maximize ecological validity, and allow study of microprocesses that influence behavior in real-world contexts. EMA studies assess particular events in subjects' lives or assess subjects at periodic intervals, often by random time sampling, using technologies ranging from written diaries and telephones to electronic diaries and physiological sensors. We discuss the rationale for EMA, EMA designs, methodological and practical issues, and comparisons of EMA and recall data. EMA holds unique promise to advance the science and practice of clinical psychology by shedding light on the dynamics of behavior in real-world settings.}, | |
author = {Shiffman, S and Stone, A A and Hufford, M R}, | |
doi = {10.1146/annurev.clinpsy.3.022806.091415}, | |
file = {:Users/explodecomputer/papers/nihms121046.pdf:pdf}, | |
isbn = {1548-5943 (Print)$\backslash$r1548-5943 (Linking)}, | |
issn = {1548-5943; 1548-5943}, | |
journal = {Annual review of clinical psychology}, | |
keywords = {*Psychology, Clinical/instrumentation/methods,*Social Environment,Data Collection/*instrumentation,Humans,Medical History Taking,Social Behavior,Time Factors}, | |
number = {5}, | |
pages = {1--32}, | |
pmid = {18509902}, | |
title = {{Ecological momentary assessment}}, | |
url = {http://www.ncbi.nlm.nih.gov/pubmed/18509902}, | |
volume = {4}, | |
year = {2008}, | |
} | |
@article{Thorgeirsson2010, | |
author = {Thorgeirsson, Thorgeir E and Gudbjartsson, Daniel F and Surakka, Ida and Vink, Jacqueline M and Amin, Najaf and Geller, Frank and Sulem, Patrick and Rafnar, Thorunn and Esko, T{\~{o}}nu and Walter, Stefan and Gieger, Christian and Rawal, Rajesh and Mangino, Massimo and Prokopenko, Inga and M{\"{a}}gi, Reedik and Keskitalo, Kaisu and Gudjonsdottir, Iris H and Gretarsdottir, Solveig and Stefansson, Hreinn and Thompson, John R and Aulchenko, Yurii S and Nelis, Mari and Aben, Katja K and den Heijer, Martin and Dirksen, Asger and Ashraf, Haseem and Soranzo, Nicole and Valdes, Ana M and Steves, Claire and Uitterlinden, Andr{\'{e}} G and Hofman, Albert and T{\"{o}}njes, Anke and Kovacs, Peter and Hottenga, Jouke Jan and Willemsen, Gonneke and Vogelzangs, Nicole and D{\"{o}}ring, Angela and Dahmen, Norbert and Nitz, Barbara and Pergadia, Michele L and Saez, Berta and {De Diego}, Veronica and Lezcano, Victoria and Garcia-Prats, Maria D and Ripatti, Samuli and Perola, Markus and Kettunen, Johannes and Hartikainen, Anna-Liisa and Pouta, Anneli and Laitinen, Jaana and Isohanni, Matti and Huei-Yi, Shen and Allen, Maxine and Krestyaninova, Maria and Hall, Alistair S and Jones, Gregory T and van Rij, Andre M and Mueller, Thomas and Dieplinger, Benjamin and Haltmayer, Meinhard and Jonsson, Steinn and Matthiasson, Stefan E and Oskarsson, Hogni and Tyrfingsson, Thorarinn and Kiemeney, Lambertus A and Mayordomo, Jose I and Lindholt, Jes S and Pedersen, Jesper Holst and Franklin, Wilbur A and Wolf, Holly and Montgomery, Grant W and Heath, Andrew C and Martin, Nicholas G and Madden, Pamela A F and Giegling, Ina and Rujescu, Dan and J{\"{a}}rvelin, Marjo-Riitta and Salomaa, Veikko and Stumvoll, Michael and Spector, Tim D and Wichmann, H-Erich and Metspalu, Andres and Samani, Nilesh J and Penninx, Brenda W and Oostra, Ben A and Boomsma, Dorret I and Tiemeier, Henning and van Duijn, Cornelia M and Kaprio, Jaakko and Gulcher, Jeffrey R and McCarthy, Mark I and Peltonen, Leena and Thorsteinsdottir, Unnur and Stefansson, Kari}, | |
doi = {10.1038/ng.573}, | |
file = {:Users/explodecomputer/papers/ng.573.pdf:pdf}, | |
issn = {1061-4036}, | |
journal = {Nature Genetics}, | |
number = {5}, | |
pages = {448--453}, | |
title = {{Sequence variants at CHRNB3–CHRNA6 and CYP2A6 affect smoking behavior}}, | |
url = {http://www.nature.com/doifinder/10.1038/ng.573}, | |
volume = {42}, | |
year = {2010}, | |
} | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: A frailty effect is sufficient to explain the apparent association of a smoking heaviness variant on smoking initiation in a cross sectional population | |
date: "`r Sys.Date()`" | |
output: pdf_document | |
bibliography: smoking_age.bib | |
--- | |
```{r, echo=FALSE} | |
opts_chunk$set(warning=FALSE, message=FALSE, echo=FALSE) | |
``` | |
```{r, echo=FALSE} | |
set.seed(1234) | |
suppressPackageStartupMessages(library(foreign)) | |
suppressPackageStartupMessages(library(dplyr)) | |
suppressPackageStartupMessages(library(ggplot2)) | |
suppressPackageStartupMessages(library(pander)) | |
suppressPackageStartupMessages(library(knitr)) | |
suppressPackageStartupMessages(library(reshape2)) | |
suppressPackageStartupMessages(library(sn)) | |
``` | |
## Overview of the simulation | |
The CHRNA5 variant rs1051730 is thought to associate with smoking quantity but not with smoking initiation. **Perform a simulation to see if a frailty effect is sufficient to generate the apparent association of the CHRNA5 variant with smoking initiation.** | |
- We will simulate a cross sectional population that has a range of ages from 20 to 100. - Each individual will be given a status of `alive/dead`. | |
- This status is determined by age and one explanatory variable: number of cigarettes per day (CPD). | |
- CPD is simulated to have a distribution amongst smokers similar to that described previously [@Shiffman2008], and with an effect related to the *smoking quantity* SNP in CHRNA5 [@Thorgeirsson2010]. | |
- The difference in mortality rates between `ever/never` smokers is shown to be approximately 22% at 70 years [@Doll2004]. We need to determine the corresponding effect size for CPD on mortality. This is done through simulation - finding the effect size for CPD on mortality that recapitulates an effect of 22% reduction in survival at 70 years for ever smokers vs never smokers. | |
- We can then estimate the expected effect of the *smoking quantity* variant on *smoking initiation* that is induced by frailty effect. | |
## Parameter estimates | |
Estimate the allele frequency of the CHRNA5 variant (rs1051730) in people below the age of 40 using the Copenhagen general population data. | |
```{r, echo=FALSE } | |
copenhagen <- read.dta("~/repo/copenhagen/data/scratch/bristolv3_12.dta") | |
snp <- copenhagen$CHRNA3_rs1051730 | |
snp <- as.character(copenhagen$CHRNA3_rs1051730) | |
snp[snp=="GG"] <- 0 | |
snp[snp=="GA"] <- 1 | |
snp[snp=="AA"] <- 2 | |
snp <- as.numeric(snp) | |
copenhagen$snp <- snp | |
af <- filter(copenhagen, age < 40) %>% | |
summarise(rs1051730=sum(snp, na.rm=T) / sum(!is.na(snp)) / 2) %>% | |
unlist() | |
pander(table(snp)) | |
``` | |
Allele frequency of A allele: `r round(af, 2)`. | |
The relationship of allele frequency with age looks like this: | |
```{r, echo=FALSE } | |
copenhagen$agebin <- cut(copenhagen$age, breaks=20) | |
af_by_age <- copenhagen %>% | |
group_by(agebin) %>% | |
summarise(af = sum(snp, na.rm=T) / (sum(!is.na(snp))*2), n=sum(!is.na(snp)), se = sd(snp, na.rm=T) / sqrt(n)) | |
ggplot(af_by_age, aes(x=agebin, y=af)) + | |
geom_errorbar(aes(ymin=af - se*1.96, ymax=af+se*1.96), width=0) + | |
geom_point() + | |
theme(axis.text.x=element_text(angle=90)) + | |
labs(x="Age", y="Allele frequency") | |
``` | |
The proportion of ever smokers in individuals < 40 in the Copenhagen data is: | |
```{r } | |
copenhagen$ever <- as.character(copenhagen$status_smoking) | |
copenhagen$ever[copenhagen$ever == "Never smoker"] <- FALSE | |
copenhagen$ever[copenhagen$ever != "FALSE"] <- TRUE | |
copenhagen$ever <- as.logical(copenhagen$ever) | |
prop_smok <- filter(copenhagen, age < 40) %>% | |
summarise(prop_ever = sum(ever, na.rm=T) / sum(!is.na(ever))) %>% | |
unlist() | |
pander(prop_smok) | |
``` | |
The relationship of ever/never ratio with age looks like this: | |
```{r, echo=FALSE } | |
ever_by_age <- copenhagen %>% | |
group_by(agebin) %>% | |
summarise(prop_ever = sum(ever, na.rm=T) / (sum(!is.na(ever))), n=sum(!is.na(ever)), se = sd(ever, na.rm=T) / sqrt(n)) | |
ggplot(ever_by_age, aes(x=agebin, y=prop_ever)) + | |
geom_errorbar(aes(ymin=prop_ever - se*1.96, ymax=prop_ever+se*1.96), width=0) + | |
geom_point() + | |
theme(axis.text.x=element_text(angle=90)) + | |
labs(x="Age", y="Proportion of ever smokers") | |
``` | |
## Simulate cross sectional population | |
### Population characteristics | |
Set a sample size and age ranges | |
```{r, echo=TRUE} | |
n <- 110000 | |
age <- rnorm(n, mean(copenhagen$age, na.rm=T), sd(copenhagen$age, na.rm=T)) | |
``` | |
### Smoking characteristics | |
Arbitrarily assign `ever/never` smoking satus | |
```{r } | |
ever <- rbinom(n, 1, prop_smok) | |
pander(table(ever)) | |
``` | |
Create a SNP with allele frequency of `r af` (independent of `ever/never` status) | |
```{r } | |
snp <- rbinom(n, 2, 1-af) | |
pander(table(snp)) | |
``` | |
We use this SNP to generate a *smoking quantity* variable (cigarettes per day, `cpd`), which has an effect size on CPD as reported in [@Thorgeirsson2010] amongst `ever` smokers, but all `never` smokers have `cpd` value of 0. | |
```{r } | |
snp_effect <- 3 | |
cpd <- pmax(1, rsn(n, 11, 14, 2)) + snp * snp_effect | |
cpd[ever==0] <- 0 | |
dat <- data.frame(age, snp, ever, cpd, ager=round(age)) | |
``` | |
The distribution of CPD among `ever` smokers is skewed to match the data shown in [@Shiffman2008]: | |
```{r } | |
hist(cpd[ever==1], xlab="CPD among ever smokers", main="") | |
``` | |
The mean CPD amongst ever smokers is `r round(mean(cpd[ever==1]), 2)` and the effect of the SNP on *smoking quantity* amongst smokers is `r snp_effect` in CPD units. | |
### Mortality | |
The `alive/dead` status will be assumed to follow the Gompertz-Makeham law of mortality. Its hazard function is | |
$$ | |
h(t) = \alpha \exp(\beta x_i) + \lambda | |
$$ | |
which gives survival function of | |
$$ | |
S(t) = 1 - F(t) = 1-1-\exp(-\lambda x-\frac{\alpha}{\beta}(e^{\beta x}-1)) | |
$$ | |
where $F(t)$ is the cumulative density with parameters | |
```{r } | |
gm_cdf <- function(t, alpha, beta, lambda) | |
{ | |
pmax(1 - exp( -lambda * t - alpha / beta * (exp(beta*t) - 1) ), 0) | |
} | |
alpha <- 7.478359e-05 | |
beta <- 8.604875e-02 | |
lambda <- -1.846973e-03 | |
``` | |
which gives a baseline mortality curve that looks like: | |
```{r } | |
plot(x=1:100, y=1 - gm_cdf(1:100, alpha, beta, lambda), xlab="Age", ylab="S(t)", type="l") | |
``` | |
```{r } | |
survival_diff_age <- 70 | |
target_survival_diff <- 0.22 | |
``` | |
The estimated difference in survival between ever and never cigeratte smokers at the age of `r survival_diff_age` is about `r target_survival_diff*100`% [@Doll2004]. To simulate this effect we can use a life acceleration model such that | |
$$ | |
S_{ever}(t) = S_{never}(t / \exp(\beta x)) | |
$$ | |
where $x$ takes a value of 0 for never smokers and CPD for ever smokers and $\beta$ is its multiplicative effect on survival. | |
We can simulate the `alive/dead` status of each individual in our population using the following binomial model: | |
$$ | |
P(Status = alive) = Binom(n = 1, p = 1 - F(t/\exp(\beta x_i))) | |
$$ | |
where $F(t)$ is the cumulative density function of the Gompertz-Makeham. We need to find a value of $\beta$ that gives a survival difference of `r target_survival_diff` at age `r survival_diff_age`. | |
```{r } | |
get_estimate <- function(dat, candidate, alpha, beta, lambda, nsim) | |
{ | |
dat$alive <- 0 | |
for(i in 1:nsim) | |
{ | |
alive <- rbinom( | |
nrow(dat), | |
1, | |
1 - gm_cdf(dat$age / exp(dat$cpd * candidate), | |
alpha, beta, lambda) | |
) | |
dat$alive <- dat$alive + alive | |
} | |
dat$alive <- dat$alive / nsim | |
s <- melt(tapply(dat$alive, list(dat$ager, dat$ever), mean)) | |
s$Var2 <- as.factor(s$Var2) | |
levels(s$Var2) <- c("Baseline", "Ever smoker") | |
s70 <- subset(s, Var1==70) | |
p <- ggplot(s, aes(x=Var1, y=value)) + | |
geom_line(aes(colour=factor(Var2))) + | |
labs(x="Age", y="S(t)", colour="Curve") | |
return(list(d=s70$value[1] - s70$value[2], p=p)) | |
} | |
candidates <- seq(-0.01, -0.001, by=0.0005) | |
l <- list() | |
for(i in 1:length(candidates)) | |
{ | |
l[[i]] <- get_estimate(dat, candidates[i], alpha, beta, lambda, 10) | |
} | |
m <- which.min(abs(target_survival_diff - sapply(l, function(x) x$d))) | |
cpd_effect <- candidates[m] | |
dat$alive <- rbinom( | |
nrow(dat), | |
1, | |
1 - gm_cdf(dat$age / exp(dat$cpd * cpd_effect), | |
alpha, beta, lambda) | |
) | |
``` | |
We can do this using simulation which gives an effect of CPD on mortality of `r cpd_effect`. The survival curves for baseline and ever smokers look like this (note, shakiness of curves due to stochasticity from simulations): | |
```{r } | |
print(l[[m]]$p) | |
``` | |
## Frailty effect estimation | |
Perform GLM of ever/never on SNP (e.g. effects are on liability scale, not OR) for age bins of `>40`, `>45`, etc (`min_age` categories); or `<40`, `<45` etc (`max_age` categories). We do this both empirically with the Copenhagen data, and using the simulated data. | |
```{r } | |
age_bins <- seq(40, 70, by=5) | |
res1 <- lapply(age_bins, function(x) | |
{ | |
d <- dat[dat$age <= x & dat$alive==1, ] | |
mod <- summary(glm(ever ~ snp, data=d, family="binomial")) | |
a <- data.frame( | |
age = x, | |
Estimate = coefficients(mod)[2,1], | |
SE = coefficients(mod)[2,2], | |
n = sum(dat$age <= x), | |
desc = "Max age") | |
return(a) | |
}) | |
res2 <- lapply(age_bins, function(x) | |
{ | |
d <- dat[dat$age >= x & dat$alive==1, ] | |
mod <- summary(glm(ever ~ snp, data=d, family="binomial")) | |
a <- data.frame( | |
age = x, | |
Estimate = coefficients(mod)[2,1], | |
SE = coefficients(mod)[2,2], | |
n = sum(dat$age > x), | |
desc = "Min age") | |
return(a) | |
}) | |
res1 <- bind_rows(res1) | |
res2 <- bind_rows(res2) | |
res <- rbind(res1, res2) | |
res$source <- "Simulation" | |
res1 <- lapply(age_bins, function(x) | |
{ | |
d <- copenhagen[copenhagen$age <= x, ] | |
mod <- summary(glm(ever ~ snp, data=d, family="binomial")) | |
a <- data.frame( | |
age = x, | |
Estimate = coefficients(mod)[2,1], | |
SE = coefficients(mod)[2,2], | |
n = sum(copenhagen$age <= x), | |
desc = "Max age") | |
return(a) | |
}) | |
res2 <- lapply(age_bins, function(x) | |
{ | |
d <- copenhagen[copenhagen$age >= x, ] | |
mod <- summary(glm(ever ~ snp, data=d, family="binomial")) | |
a <- data.frame( | |
age = x, | |
Estimate = coefficients(mod)[2,1], | |
SE = coefficients(mod)[2,2], | |
n = sum(copenhagen$age > x), | |
desc = "Min age") | |
return(a) | |
}) | |
res1 <- bind_rows(res1) | |
res2 <- bind_rows(res2) | |
res_empirical <- rbind(res1, res2) | |
res_empirical$source <- "Copenhagan" | |
res <- rbind(res, res_empirical) | |
write.csv(res, file="smoking_age.csv") | |
ggplot(res, aes(y=age, x=Estimate)) + | |
geom_point() + | |
geom_errorbarh(aes(xmin=Estimate-SE*1.96, xmax=Estimate+SE*1.96), height=0) + | |
scale_y_reverse() + | |
facet_grid(desc ~ source) + | |
labs(x="Estimate of rs1051730 on smoking initiation") + | |
geom_vline(xintercept=0) | |
``` | |
The empirical results in the Copenhagen data shows a clear trend of the SNP having a protective effect on smoking initiation as age increases, consistent with a frailty effect. This pattern is reproduced relatively well in the simulated data which assumes the following model: | |
1. The SNP has an effect on CPD among ever smokers | |
2. Survival is a Gompertz-Makeham accelerated age model with the only explanatory variable being CPD | |
3. The effect of CPD on accelerated age is simulated such that survival reduces by `r target_survival_diff * 100`% at `r survival_diff_age` years amongst ever smokers. | |
--- | |
Source code available at: [https://gist.github.com/explodecomputer/9dde3550f5c567106844](https://gist.github.com/explodecomputer/9dde3550f5c567106844) | |
This document was rendered using | |
```{r eval=FALSE, echo=TRUE} | |
library(rmarkdown) | |
render("smoking_age.Rmd") | |
``` | |
## References |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment