Skip to content

Instantly share code, notes, and snippets.

@explodecomputer
Last active September 17, 2016 13:19
Show Gist options
  • Save explodecomputer/50bf3a93a883b91f5a47cb8952c746cd to your computer and use it in GitHub Desktop.
Save explodecomputer/50bf3a93a883b91f5a47cb8952c746cd to your computer and use it in GitHub Desktop.
meta cca test of yy
# Testing metaCCA estimation of phenotypic correlation
## Introduction
Phenotypic correlation between two traits is a function of the shared genetic effects and shared environmental effects. For example suppose two traits are influenced by a single SNP, and the effect size of the SNP is the same on both traits. The genetic correlation is 1. If they also share the same environmental factors, but the effect of the factor on the first trait is positive, and the effect on the second trait is negative, the environmental correlation is -1. The phenotypic correlation of these two traits would be 0 (assuming equal variance of genetic and environmental factors).
These simulations just show that the metaCCA method of estimating phenotypic correlations are actually only estimating the genetic correlations. It also compares the estimate using a single sample or two samples.
From the perspective of metaCCA, whose objective is to estimate the joint effects of a SNP on correlated outcomes, it doesn't matter that it is estimating the genetic correlation and calling it the phenotypic correlation. This is because the joint effect of a SNP on correlated phenotypes has almost no dependence on the shared environmental effects (this likely just contributes to residual variance in the model). It is largely predicated on the shared genetic effects, which is precisely what it is calculating.
## MetaCCA
The metaCCA method of estimating 'phenotypic' correlation between traits 1 and 2 is:
$$
\sigma_{YY}(1,2) = \frac{\sum (\beta_1 - \mu_1)(\beta_2 - \mu_2)} {\sqrt{\sum{(\beta_1 - \mu_1)^2}}\sqrt{\sum{(\beta_2 - \mu_2)^2}}}
$$
Note that this can basically be simplified to:
$$
\frac{cov(\beta_1, \beta_2)} {sd(\beta_1)sd(\beta_2)}
$$
This is simply the formula for the genetic correlation. It can be obtained from the following function:
```{r }
metacca_yy <- function(bhat1, bhat2)
{
sum((bhat1 - mean(bhat1)) * (bhat2 - mean(bhat2))) / (sqrt(sum((bhat1 - mean(bhat1))^2)) * sqrt(sum((bhat2 - mean(bhat2))^2)))
}
```
Indeed, this formula is identical to the calculation for correlations:
```{r }
a <- rnorm(100)
b <- rnorm(100) + a
cor(a, b)
metacca_yy(a, b)
```
## Simulations
Two questions:
1. Do non-genetic correlations influence the metaCCA method of calculating what it calls 'phenotypic' correlation, labelled $\sigma_YY$ in the paper?
2. If not, can we use these genetic correlations in a two-sample setting.
- Two traits
- Same SNPs have effects
- Effects are 50% correlated
- Simulate the phenotypes for two independent samples
Sample parameters:
```{r }
# First sample size
nidA <- 10000
# Second sample size
nidB <- 10000
# Number of genetic factors
nsnp <- 100
# Number of environmental factors
nenv <- 100
```
Simulate the SNPs and environmental factors for each sample
```{r }
e1 <- scale(matrix(rnorm(nenv * nidA), nidA, nenv))
g1 <- scale(matrix(rbinom(nsnp * nidA, 2, 0.5), nidA, nsnp))
e2 <- scale(matrix(rnorm(nenv * nidB), nidB, nenv))
g2 <- scale(matrix(rbinom(nsnp * nidB, 2, 0.5), nidB, nsnp))
```
Note that the genotypes are all scaled to have mean 0 and variance 1. Same for the environmental factors.
Specify the genetic correlation
```{r }
rg <- sqrt(0.5)
```
Specify the environmental correlation
```{r }
re <- - sqrt(0.5)
```
Generate genetic and environmental effects for each trait. The effects on each trait need to be correlated based on `rg` and `re`.
```{r }
library(mvtnorm)
eff_g <- rmvnorm(nsnp, c(0,0), sigma=matrix(c(1, rg, rg, 1), 2))
eff_e <- rmvnorm(nenv, c(0,0), sigma=matrix(c(1, re, re, 1), 2))
```
e.g. The genetic effects on trait 1 (`eff_g[,1]`) and genetic effects on trait 2 (`eff_g[,2]`) are correlated:
```{r }
cor(eff_g[,1], eff_g[,2])
cor(eff_e[,1], eff_e[,2])
```
The phenotype can now be simulated simply as
$$
y = \textbf{Gb_{g}} + \textbf{Eb_{e}}
$$
where $G$ is the scaled genotype matrix, $E$ is the scaled environmental factor matrix, and $b_{g}$ and $b_{e}$ are the corresponding effects for SNPs and environment.
We can simulate phenotypes for traits 1 and 2, in each of the two samples.
```{r }
y1_sampleA <- scale(g1 %*% eff_g[,1] + e1 %*% eff_e[,1])
y1_sampleB <- scale(g2 %*% eff_g[,1] + e2 %*% eff_e[,1])
y2_sampleA <- scale(g1 %*% eff_g[,2] + e1 %*% eff_e[,2])
y2_sampleB <- scale(g2 %*% eff_g[,2] + e2 %*% eff_e[,2])
```
The phenotypic correlations can now be calculated within each sample. Sample 1:
```{r }
cor(y1_sampleA, y2_sampleA)
```
Sample 2:
```{r }
cor(y1_sampleB, y2_sampleB)
```
We need to obtain GWAS effect sizes to be able to perform this calculation
```{r }
gwas <- function(g, y)
{
b <- array(ncol(g))
for(i in 1:ncol(g))
{
b[i] <- cov(g[,i], y) / var(g[,i])
}
return(b)
}
bhat1_sampleA <- gwas(g1, y1_sampleA)
bhat2_sampleA <- gwas(g1, y2_sampleA)
bhat1_sampleB <- gwas(g2, y1_sampleB)
bhat2_sampleB <- gwas(g2, y2_sampleB)
```
We can now estimate the metaCCA based estimate of
```{r }
metacca_yy(bhat1_sampleA, bhat2_sampleA)
metacca_yy(bhat1_sampleB, bhat2_sampleA)
metacca_yy(bhat1_sampleA, bhat2_sampleB)
metacca_yy(bhat1_sampleB, bhat2_sampleB)
```
Note that these are close to the simulated genetic correlation `r rg`, not the phenotypic correlation `cor(y1_sampleA, y2_sampleA) = `, `r cor(y1_sampleA, y2_sampleA)`. Indeed, they are identical to the simple correlation between the $\hat{\beta}$ estimates:
```{r }
cor(bhat1_sampleA, bhat2_sampleA)
cor(bhat1_sampleB, bhat2_sampleA)
cor(bhat1_sampleA, bhat2_sampleB)
cor(bhat1_sampleB, bhat2_sampleB)
```
## Further simulations
Set different parameters:
```{r }
param <- expand.grid(
nsim = 1:10,
rg = c(-0.5, 0.5),
re = c(-0.5, 0.5),
nidA = c(1000, 10000),
nidB = c(1000, 10000),
nsnp = 100,
nenv = 100
)
```
Perform simulations. We will calculate $\sigma_YY$ between traits 1 and 2 in the one-sample and two-sample cases.
```{r }
sim <- function(param, i)
{
nenv <- param$nenv[i]
nsnp <- param$nsnp[i]
nidA <- param$nidA[i]
nidB <- param$nidB[i]
rg <- param$rg[i]
re <- param$re[i]
e1 <- scale(matrix(rnorm(nenv * nidA), nidA, nenv))
g1 <- scale(matrix(rbinom(nsnp * nidA, 2, 0.5), nidA, nsnp))
e2 <- scale(matrix(rnorm(nenv * nidB), nidB, nenv))
g2 <- scale(matrix(rbinom(nsnp * nidB, 2, 0.5), nidB, nsnp))
eff_g <- rmvnorm(nsnp, c(0,0), sigma=matrix(c(1, rg, rg, 1), 2))
eff_e <- rmvnorm(nenv, c(0,0), sigma=matrix(c(1, re, re, 1), 2))
y1_sampleA <- scale(g1 %*% eff_g[,1] + e1 %*% eff_e[,1])
y1_sampleB <- scale(g2 %*% eff_g[,1] + e2 %*% eff_e[,1])
y2_sampleA <- scale(g1 %*% eff_g[,2] + e1 %*% eff_e[,2])
y2_sampleB <- scale(g2 %*% eff_g[,2] + e2 %*% eff_e[,2])
bhat1_sampleA <- gwas(g1, y1_sampleA)
bhat2_sampleA <- gwas(g1, y2_sampleA)
bhat1_sampleB <- gwas(g2, y1_sampleB)
bhat2_sampleB <- gwas(g2, y2_sampleB)
param$yyAA[i] = metacca_yy(bhat1_sampleA, bhat2_sampleA)
param$yyAB[i] = metacca_yy(bhat1_sampleA, bhat2_sampleB)
param$yyBA[i] = metacca_yy(bhat1_sampleB, bhat2_sampleA)
param$yyBB[i] = metacca_yy(bhat1_sampleB, bhat2_sampleB)
param$rA[i] = cor(y1_sampleA, y2_sampleA)
param$rB[i] = cor(y1_sampleB, y2_sampleB)
return(param)
}
for(i in 1:nrow(param))
{
# message(i)
param <- sim(param, i)
}
library(reshape2)
library(dplyr)
res_s <- param %>% dplyr::group_by(rg, re, nidA, nidB, nsnp, nenv) %>%
dplyr::summarise(
n = n(),
yyAA = mean(yyAA),
yyAB = mean(yyAB),
yyBA = mean(yyBA),
yyBB = mean(yyBB),
rA = mean(rA),
rB = mean(rB)
)
res_sl <- melt(res_s, id.vars = c("rg", "re", "nidA", "nidB", "nsnp", "nenv"), measure.vars=c("yyAA", "yyAB", "yyBA", "yyBB", "rA", "rB"))
```
- `yyAA` = Where traits 1 and 2 are measured in one sample (A)
- `yyBB` = Where traits 1 and 2 are measured in one sample (B)
- `yyAB` = Where trait 1 is measured in sample A and trait 2 is measured in sample B
- `yyBA` = Where trait 1 is measured in sample B and trait 2 is measured in sample A
- `rA` = Phenotypic correlation within sample A
- `rB` = Phenotypic correlation within sample B
The latter two scenarios represent the two sample case.
```{r }
library(ggplot2)
res_sl$lab1 = paste("rg =", res_sl$rg)
res_sl$lab2 = paste("re =", res_sl$re)
ggplot(subset(res_sl, nid1==10000 & nid2==10000), aes(x=variable, y=value)) +
geom_bar(stat="identity", position="dodge") +
facet_grid(lab1 ~ lab2) +
scale_fill_brewer(type="qual") +
theme(axis.text.x=element_text(angle=45, hjust = 1))
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment