explodecomputer · December 31, 2015 18:45
diff --git a/survival.Rmd b/survival.Rmd
 ---
 title: Notes on survival analysis
 author: Gibran Hemani
 date: "`r Sys.Date()`"
 output: pdf_document
 bibliography: survival.bib
 ---

 These notes are based on [http://data.princeton.edu/wws509/notes](http://data.princeton.edu/wws509/notes).

 ## The survival function

 $T$ is a continuous random variable with pdf $f(t)$ and cdf $F(t) = P(T < t)$, giving the probability that the event has occurred by duration T. An example is the Gompertz-Makeham law, which has pdf

 $$
 (\alpha e^{\beta x} + \lambda)\cdot \exp(-\lambda x-\frac{\alpha}{\beta}(e^{\beta x}-1))
 $$

 and CDF

 $$
 1-\exp(-\lambda x-\frac{\alpha}{\beta}(e^{\beta x}-1))
 $$

 These distributions look like:

 ```{r }

 gm_pdf <- function(x, a, b, lambda)
 {
 	(a * exp(b*x) + lambda) * exp(-lambda * x - a / b * (exp(b*x) - 1))
 }

 gm_cdf <- function(x, a, b, lambda)
 {
 	1 - exp( -lambda * x - a / b * (exp(b*x) - 1) )
 }

 age <- 1:100
 a <- 7.478359e-05
 b <- 8.604875e-02
 lambda <- -1.846973e-03
 par(mfrow=c(2,1))
 plot(
 	x=age, y=gm_pdf(age, a, b, lambda), 
 	main="PDF of Gompertz-Makeham"
 )
 plot(
 	x=age, y=gm_cdf(age, a, b, lambda), 
 	main="CDF of Gompertz-Makeham"
 )

 ```

 The survival function gives the probability of being alive before duration $t$, e.g. the event has not occurred by duration $t$. e.g.

 $$
 S(t) = 1 - F(t) = \int_{t}^{\infty} f(x)dx
 $$

 So $S(x) \leq S(t)$ for all $x > t$. The hazard function is 

 $$
 \lambda(t) = \lim_{dt \to 0} \frac{P(t \leq T < t+dt \mid T \geq t)}{dt}
 $$

 which is basically just the rate of occurrance per unit time. For the Gompertz-Makeham this looks like:


 ## The hazard function

 The hazard function 

 $$
 \lambda(t) = \frac{f(t)}{S(t)}
 $$

 e.g.

 ```{r }

 gm_hz <- function(x, a, b, lambda)
 {
 	a * exp(b*x) + lambda
 }

 plot(
 	x=age, y=gm_hz(age, a, b, lambda), 
 	main="Hazard function of Gompertz-Makeham"
 )

 ```

 relates to the cdf and pdf as

 ```{r }

 plot(
 	x = gm_hz(age, a, b, lambda), 
 	y = gm_pdf(age, a, b, lambda) / (1 - gm_cdf(age, a, b, lambda))
 )

 ```

 So the rate of occurrance of the event at duration $t$ equals the density of events at $t$ divided by the probability of surviving to that duration without the event happening.

 Because we know that $S(t) = 1 - F(t) = \int_{t}^{\infty} f(x)dx$, we know that the derivative of $S(t)$ is $-f(t)$. So, we can rewrite the hazard function as 

 $$
 \lambda(t) = - \frac{d}{dt}logS(t)
 $$

 This allows us to rewrite $S(t) as a function of surviving all hazards up to $t$

 $$
 S(t) = exp( - \int_{0}^{t} \lambda(x)dx)
 $$

 The integral part of this is known as the cumulative hazard ($\Delta(t)$).

 ## Explanatory variables

 Introducing explanatory variables is understood in the context of $T_i$ being the time to event for individual $i$, and modelled as

 $$
 \log(T_i) = \mathbf{{x_i}'\beta} + \epsilon_i
 $$

 where $\epsilon_i$ is a suitable error term, or indeed the baseline value for $T_i$ when there are no explanatory variables. e.g. The explanatory variable shifts the standard baseline value. This can be exponentiated

 $$
 T_i = exp(\mathbf{x_i'\beta})T_{0i}
 $$

 where $T_{0i}$ is the exponentiated error term. We can use $\gamma$ as shorthand for the multiplicative effect $exp(\mathbf{x_i'\beta})$ of the covariates.

 Interpretation of this is then straightforward - if an explanatory variable $x$ is binary (smokers vs non-smokers) then if smokers live half as long as non-smokers then $\gamma = 0.5$, or $\beta = \log(0.5) =$ `r log(0.5)`. Relating this to the survivor function, the model is interpreted as *life acceleration*. If $S_1(t)$ are smokers and $S_0(t)$ are non-smokers then 

 $$
 S_1(t) = S_0(t/\gamma)
 $$

 An alternative approach is proportional hazards [@Cox1972] which focuses on the hazard function directly.

 ## References
	---
	title: Notes on survival analysis
	author: Gibran Hemani
	date: "`r Sys.Date()`"
	output: pdf_document
	bibliography: survival.bib
	---

	These notes are based on [http://data.princeton.edu/wws509/notes](http://data.princeton.edu/wws509/notes).

	## The survival function

	$T$ is a continuous random variable with pdf $f(t)$ and cdf $F(t) = P(T < t)$, giving the probability that the event has occurred by duration T. An example is the Gompertz-Makeham law, which has pdf

	$$
	(\alpha e^{\beta x} + \lambda)\cdot \exp(-\lambda x-\frac{\alpha}{\beta}(e^{\beta x}-1))
	$$

	and CDF

	$$
	1-\exp(-\lambda x-\frac{\alpha}{\beta}(e^{\beta x}-1))
	$$

	These distributions look like:

	```{r }

	gm_pdf <- function(x, a, b, lambda)
	{
	(a * exp(bx) + lambda) exp(-lambda * x - a / b * (exp(b*x) - 1))
	}

	gm_cdf <- function(x, a, b, lambda)
	{
	1 - exp( -lambda * x - a / b * (exp(b*x) - 1) )
	}

	age <- 1:100
	a <- 7.478359e-05
	b <- 8.604875e-02
	lambda <- -1.846973e-03
	par(mfrow=c(2,1))
	plot(
	x=age, y=gm_pdf(age, a, b, lambda),
	main="PDF of Gompertz-Makeham"
	)
	plot(
	x=age, y=gm_cdf(age, a, b, lambda),
	main="CDF of Gompertz-Makeham"
	)

	```

	The survival function gives the probability of being alive before duration $t$, e.g. the event has not occurred by duration $t$. e.g.

	$$
	S(t) = 1 - F(t) = \int_{t}^{\infty} f(x)dx
	$$

	So $S(x) \leq S(t)$ for all $x > t$. The hazard function is

	$$
	\lambda(t) = \lim_{dt \to 0} \frac{P(t \leq T < t+dt \mid T \geq t)}{dt}
	$$

	which is basically just the rate of occurrance per unit time. For the Gompertz-Makeham this looks like:


	## The hazard function

	The hazard function

	$$
	\lambda(t) = \frac{f(t)}{S(t)}
	$$

	e.g.

	```{r }

	gm_hz <- function(x, a, b, lambda)
	{
	a * exp(b*x) + lambda
	}

	plot(
	x=age, y=gm_hz(age, a, b, lambda),
	main="Hazard function of Gompertz-Makeham"
	)

	```

	relates to the cdf and pdf as

	```{r }

	plot(
	x = gm_hz(age, a, b, lambda),
	y = gm_pdf(age, a, b, lambda) / (1 - gm_cdf(age, a, b, lambda))
	)

	```

	So the rate of occurrance of the event at duration $t$ equals the density of events at $t$ divided by the probability of surviving to that duration without the event happening.

	Because we know that $S(t) = 1 - F(t) = \int_{t}^{\infty} f(x)dx$, we know that the derivative of $S(t)$ is $-f(t)$. So, we can rewrite the hazard function as

	$$
	\lambda(t) = - \frac{d}{dt}logS(t)
	$$

	This allows us to rewrite $S(t) as a function of surviving all hazards up to $t$

	$$
	S(t) = exp( - \int_{0}^{t} \lambda(x)dx)
	$$

	The integral part of this is known as the cumulative hazard ($\Delta(t)$).

	## Explanatory variables

	Introducing explanatory variables is understood in the context of $T_i$ being the time to event for individual $i$, and modelled as

	$$
	\log(T_i) = \mathbf{{x_i}'\beta} + \epsilon_i
	$$

	where $\epsilon_i$ is a suitable error term, or indeed the baseline value for $T_i$ when there are no explanatory variables. e.g. The explanatory variable shifts the standard baseline value. This can be exponentiated

	$$
	T_i = exp(\mathbf{x_i'\beta})T_{0i}
	$$

	where $T_{0i}$ is the exponentiated error term. We can use $\gamma$ as shorthand for the multiplicative effect $exp(\mathbf{x_i'\beta})$ of the covariates.

	Interpretation of this is then straightforward - if an explanatory variable $x$ is binary (smokers vs non-smokers) then if smokers live half as long as non-smokers then $\gamma = 0.5$, or $\beta = \log(0.5) =$ `r log(0.5)`. Relating this to the survivor function, the model is interpreted as life acceleration. If $S_1(t)$ are smokers and $S_0(t)$ are non-smokers then

	$$
	S_1(t) = S_0(t/\gamma)
	$$

	An alternative approach is proportional hazards [@Cox1972] which focuses on the hazard function directly.

	## References