explodecomputer · January 11, 2017 12:07
diff --git a/rg_pcs.Rmd b/rg_pcs.Rmd
 ---
 title: Simulations of estimating genetic correlations on phenotypic principal components
 author: Gibran Hemani
 date: "`r Sys.Date()`"
 output: 
  pdf_document:
    keep_tex: true
 ---

 **To reproduce this analysis, code is available here:** [ https://gist.github.com/617a64134f74b3ee8eaba704b8949284 ]()

 To build, run:

 ```r
 rmarkdown::render("rg_pcs.Rmd")
 ```

 ```{r, echo=FALSE, cache=FALSE}
 library(knitr)
 opts_chunk$set(warning=FALSE, message=FALSE, cache=TRUE, echo=TRUE)
 ```

 ## Background

 1. Calculating principal components on phenotypes will return orthogonal axes of variation, i.e. each component will be uncorrelated with each other component.

 2. The PCs are generated agnostic of sources of correlations, so any shared genetic effects that existed within the original phenotypes should be lost between the components.

 3. Non-zero genetic correlations could feasibly arise if the shared genetic effects between two components were perfectly counter balanced by shared environmental effects operating in the opposite direction. 

 These simulations attempt to just check to evaluate that (2) is sane...

 ## Simulations

 The parameters are as follows

 ```{r }
 nsnp <- 1000
 nid <- 3000
 nphen <- 100
 neff <- 100
 mean_env_correlation <- 0.1
 mean_gen_correlation <- 0.5
 ```

 Simulate SNPs for `r nsnp` SNPs and `r nid` individuals

 ```{r }
 g <- matrix(rbinom(nsnp * nid, 2, 0.5), nid, nsnp)
 ```

 Simulate the environmental correlations between the `r nphen` phenotypes, such that every phenotype has approximately the same environmental correlation `r mean_env_correlation`

 ```{r }
 library(mvtnorm)

 rcov <- matrix(mean_env_correlation, nphen, nphen)
 diag(rcov) <- 1
 residual <- rmvnorm(nid, mean=rep(0, nphen), sigma=rcov)
 ```

 Now simulate the SNP effects for each phenotype. We want the phenotypes to have a mean genetic correlation of `r mean_gen_correlation`. I will make a slightly more complicated simulation - half of the genetic correlations are `r mean_gen_correlation`, and the other half are `r -mean_gen_correlation`.

 ```{r }
 b1 <- matrix(mean_gen_correlation, nphen/2, nphen/2)
 b2 <- matrix(-mean_gen_correlation, nphen/2, nphen/2)

 reff <- rbind(cbind(b1, b2), cbind(b2, b1))
 diag(reff) <- 1
 geneff <- rmvnorm(nsnp, mean=rep(0, nphen), sigma=reff)
 ```

 Construct the genetic effects for each phenotype

 ```{r }
 gen <- matrix(0, nid, nphen)
 for(i in 1:nphen)
 {
 	gen[,i] <- scale(g %*% geneff[,i])
 }
 ```

 Now create the phenotypes from the genetic and environmental effects

 ```{r }
 phen <- gen + residual
 ```

 The average heritability will be 0.5 because the variance of the genetic and environmental factors are the same. The average phenotypic correlation for the phenotype pairs with positive genetic correlations will be $(r_g + r_e) / 2 = `r (mean_env_correlation + mean_gen_correlation) / 2`$. And for the negative: $(-r_g + r_e) / 2 = `r (mean_env_correlation + -mean_gen_correlation) / 2`$. Check that we get this:

 ```{r }
 phen_cor <- cor(phen)
 diag(phen_cor) <- NA
 hist(phen_cor)
 ```

 ## Analyse simulated data


 ### Basic phenotypes

 Perform GWAS of each phenotype

 ```{r }
 gwas <- function(x, g)
 {
 	eff <- matrix(0, ncol(x), ncol(g))
 	for(i in 1:ncol(x))
 	{
 		for(j in 1:ncol(g))
 		{
 			eff[i,j] <- cov(x[,i], g[,j]) / var(g[,j])
 		}
 	}
 	return(eff)
 }

 # Matrix of nphen x nsnp
 res_phen <- gwas(phen, g)
 ```

 The empirical genetic correlation can then be obtained. In the graph below there are `r nphen` $\times$ `r nphen` points - one for each pair of phenotypes.

 ```{r }
 gcor_phen <- cor(t(res_phen))
 diag(gcor_phen) <- NA

 gcor_simulated <- cor(gen)
 diag(gcor_simulated) <- NA

 plot(abs(gcor_phen) ~ abs(gcor_simulated), 
 	xlab="| Simulated genetic correlations |", 
 	ylab="| Empirical genetic correlations |")
 ```

 ### Phenotypic PCs

 Now do the same for the principal components of the phenotypes

 ```{r }
 library(ggplot2)

 pcs <- princomp(phen)$scores
 res_pcs <- gwas(pcs, g)

 gcor_pcs <- cor(t(res_pcs))
 diag(gcor_pcs) <- NA
 hist(gcor_pcs)
 ```

 The genetic correlations of the PCs are basically all 0. 



 ### Phenotypic independent components

 What happens if we decompose using ICA instead of PCA? ICA works by decomposing the phenotype matrix $X$ into $S$ is the matrix of independent components and $A$ is a 'mixing matrix', such that $X = SA$. In this way, $S$ can be though of latent variables and $A$ is the linear transformation to the latent variables to recreate the original phenotypes.

 You need to choose the number of latent variables for ICA. I will just use 40, but there are methods to estimate the model complexity i.e. Automatic Relevance Determination, see section 5.2 of [ http://www.robots.ox.ac.uk/~sjrob/Pubs/ica_eeg_pos.pdf ](http://www.robots.ox.ac.uk/~sjrob/Pubs/ica_eeg_pos.pdf).

 ```{r }
 library(fastICA)
 ica_result <- fastICA(phen, n.comp=40)
 ics <- ica_result$S

 res_ics <- gwas(ics, g)
 gcor_ics <- cor(t(res_ics))
 diag(gcor_ics) <- NA
 hist(gcor_ics)
 ```

 Heatmaps of the genetic correlations -

 ```{r }
 library(ggplot2)
 library(reshape2)
 a <- melt(gcor_phen)
 a$id <- "Phenotypes"
 b <- melt(gcor_pcs)
 b$id <- "PCs"
 c <- melt(gcor_simulated)
 c$id <- "Simulated"
 d <- melt(gcor_ics)
 d$id <- "ICs"
 dat <- rbind(a, b, c, d)

 ggplot(dat, aes(x=Var1, y=Var2)) +
 geom_tile(aes(fill=value)) +
 facet_grid(. ~ id) +
 labs(x="Index of phenotypes", 
 	y="Index of phenotypes", 
 	fill="Genetic correlation") +
 scale_fill_distiller(palette = "Spectral")

 ```

 ## Summary

 Neither the ICs nor the PCs of the phenotypes have any genetic correlations. It might be worth tweaking the parameters here to see if they can elucidate something that looks a little bit less expected - i.e. under what conditions could the kind of results from the hip shape analysis be obtained.
	---
	title: Simulations of estimating genetic correlations on phenotypic principal components
	author: Gibran Hemani
	date: "`r Sys.Date()`"
	output:
	pdf_document:
	keep_tex: true
	---

	To reproduce this analysis, code is available here: [ https://gist.github.com/617a64134f74b3ee8eaba704b8949284 ]()

	To build, run:

	```r
	rmarkdown::render("rg_pcs.Rmd")
	```

	```{r, echo=FALSE, cache=FALSE}
	library(knitr)
	opts_chunk$set(warning=FALSE, message=FALSE, cache=TRUE, echo=TRUE)
	```

	## Background

	1. Calculating principal components on phenotypes will return orthogonal axes of variation, i.e. each component will be uncorrelated with each other component.

	2. The PCs are generated agnostic of sources of correlations, so any shared genetic effects that existed within the original phenotypes should be lost between the components.

	3. Non-zero genetic correlations could feasibly arise if the shared genetic effects between two components were perfectly counter balanced by shared environmental effects operating in the opposite direction.

	These simulations attempt to just check to evaluate that (2) is sane...

	## Simulations

	The parameters are as follows

	```{r }
	nsnp <- 1000
	nid <- 3000
	nphen <- 100
	neff <- 100
	mean_env_correlation <- 0.1
	mean_gen_correlation <- 0.5
	```

	Simulate SNPs for `r nsnp` SNPs and `r nid` individuals

	```{r }
	g <- matrix(rbinom(nsnp * nid, 2, 0.5), nid, nsnp)
	```

	Simulate the environmental correlations between the `r nphen` phenotypes, such that every phenotype has approximately the same environmental correlation `r mean_env_correlation`

	```{r }
	library(mvtnorm)

	rcov <- matrix(mean_env_correlation, nphen, nphen)
	diag(rcov) <- 1
	residual <- rmvnorm(nid, mean=rep(0, nphen), sigma=rcov)
	```

	Now simulate the SNP effects for each phenotype. We want the phenotypes to have a mean genetic correlation of `r mean_gen_correlation`. I will make a slightly more complicated simulation - half of the genetic correlations are `r mean_gen_correlation`, and the other half are `r -mean_gen_correlation`.

	```{r }
	b1 <- matrix(mean_gen_correlation, nphen/2, nphen/2)
	b2 <- matrix(-mean_gen_correlation, nphen/2, nphen/2)

	reff <- rbind(cbind(b1, b2), cbind(b2, b1))
	diag(reff) <- 1
	geneff <- rmvnorm(nsnp, mean=rep(0, nphen), sigma=reff)
	```

	Construct the genetic effects for each phenotype

	```{r }
	gen <- matrix(0, nid, nphen)
	for(i in 1:nphen)
	{
	gen[,i] <- scale(g %*% geneff[,i])
	}
	```

	Now create the phenotypes from the genetic and environmental effects

	```{r }
	phen <- gen + residual
	```

	The average heritability will be 0.5 because the variance of the genetic and environmental factors are the same. The average phenotypic correlation for the phenotype pairs with positive genetic correlations will be $(r_g + r_e) / 2 = `r (mean_env_correlation + mean_gen_correlation) / 2`$. And for the negative: $(-r_g + r_e) / 2 = `r (mean_env_correlation + -mean_gen_correlation) / 2`$. Check that we get this:

	```{r }
	phen_cor <- cor(phen)
	diag(phen_cor) <- NA
	hist(phen_cor)
	```

	## Analyse simulated data


	### Basic phenotypes

	Perform GWAS of each phenotype

	```{r }
	gwas <- function(x, g)
	{
	eff <- matrix(0, ncol(x), ncol(g))
	for(i in 1:ncol(x))
	{
	for(j in 1:ncol(g))
	{
	eff[i,j] <- cov(x[,i], g[,j]) / var(g[,j])
	}
	}
	return(eff)
	}

	# Matrix of nphen x nsnp
	res_phen <- gwas(phen, g)
	```

	The empirical genetic correlation can then be obtained. In the graph below there are `r nphen` $\times$ `r nphen` points - one for each pair of phenotypes.

	```{r }
	gcor_phen <- cor(t(res_phen))
	diag(gcor_phen) <- NA

	gcor_simulated <- cor(gen)
	diag(gcor_simulated) <- NA

	plot(abs(gcor_phen) ~ abs(gcor_simulated),
	xlab="\| Simulated genetic correlations \|",
	ylab="\| Empirical genetic correlations \|")
	```

	### Phenotypic PCs

	Now do the same for the principal components of the phenotypes

	```{r }
	library(ggplot2)

	pcs <- princomp(phen)$scores
	res_pcs <- gwas(pcs, g)

	gcor_pcs <- cor(t(res_pcs))
	diag(gcor_pcs) <- NA
	hist(gcor_pcs)
	```

	The genetic correlations of the PCs are basically all 0.



	### Phenotypic independent components

	What happens if we decompose using ICA instead of PCA? ICA works by decomposing the phenotype matrix $X$ into $S$ is the matrix of independent components and $A$ is a 'mixing matrix', such that $X = SA$. In this way, $S$ can be though of latent variables and $A$ is the linear transformation to the latent variables to recreate the original phenotypes.

	You need to choose the number of latent variables for ICA. I will just use 40, but there are methods to estimate the model complexity i.e. Automatic Relevance Determination, see section 5.2 of [ http://www.robots.ox.ac.uk/~sjrob/Pubs/ica_eeg_pos.pdf ](http://www.robots.ox.ac.uk/~sjrob/Pubs/ica_eeg_pos.pdf).

	```{r }
	library(fastICA)
	ica_result <- fastICA(phen, n.comp=40)
	ics <- ica_result$S

	res_ics <- gwas(ics, g)
	gcor_ics <- cor(t(res_ics))
	diag(gcor_ics) <- NA
	hist(gcor_ics)
	```

	Heatmaps of the genetic correlations -

	```{r }
	library(ggplot2)
	library(reshape2)
	a <- melt(gcor_phen)
	a$id <- "Phenotypes"
	b <- melt(gcor_pcs)
	b$id <- "PCs"
	c <- melt(gcor_simulated)
	c$id <- "Simulated"
	d <- melt(gcor_ics)
	d$id <- "ICs"
	dat <- rbind(a, b, c, d)

	ggplot(dat, aes(x=Var1, y=Var2)) +
	geom_tile(aes(fill=value)) +
	facet_grid(. ~ id) +
	labs(x="Index of phenotypes",
	y="Index of phenotypes",
	fill="Genetic correlation") +
	scale_fill_distiller(palette = "Spectral")

	```

	## Summary

	Neither the ICs nor the PCs of the phenotypes have any genetic correlations. It might be worth tweaking the parameters here to see if they can elucidate something that looks a little bit less expected - i.e. under what conditions could the kind of results from the hip shape analysis be obtained.