Last active
October 23, 2024 18:04
-
-
Save simon-anders/0c851d66fd3f58afd9148c4283bcd877 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
format: | |
html: | |
embed-resources: true | |
--- | |
# Density estimation for single-cell gene expression with MCMC | |
Simon Anders, 2024-10-23 | |
## Setting | |
In single-cell transcriptomics, we typically have a matrix $k_{ij}$ of count data, where $k_{ij}$ is the number of sequencing reads (i.e., mRNA transcript molecules) from gene $j$ that we have detected among all the sequencing reads originating from cell $i$. We put these into relation to the total count $s_i = \sum_j k_{ij}$ of reads from cell $j$: | |
$\quad$ $k_{ij}/s_i$ estimates which fraction of the mRNA molecules in cell $i$ originate from gene $j$. | |
We assume that each cell $j$ has some abstract "state" that implies a vector $(\lambda_{i1},...,\lambda_{im})$ of expected fractions (with $\sum_j\lambda_{ij}=1$), and that the actual abundance of mRNA molecules in the cell is a multinomially-distributed variable with this proportions vector. When sequencing, we detect each molecule only with a certain probability, but if we assume this probability to be the same for all genes, we may conclude that the distribution of the vector $(k_{i1},\dots,k_{im})$ is drawn from a multinomial distribution with total $s_i$ and proportions vector $(\lambda_{i1},...,\lambda_{im})$, if we condition on $s_i$. | |
If we consider only a single gene $j$, it is helpful to consider $s_i$ as a known constant (i.e., as a variable we always implicitely condition on) and to ignore the counts for the other genes, so that we do not have to ensure that the counts sum to $s_i$. Then, we may simply write | |
$$ k_i | \lambda_i\sim \text{Pois}(s_i \lambda_i), $$ {#eq-p} | |
where we have dropped the index $j$ as we, from now on, will focus on only one gene. | |
## Task | |
Given observed counts $k_i$ for some gene and observed totals $s_i$, for $n$ cells indexed as $i=1,\dots,n$, estimate the distribution of the $\lambda_i$ in @eq-p. | |
## Example data | |
Let us consider $n=1000$ cells. | |
```{r} | |
n <- 1000 | |
``` | |
To get example values for the totals $s_i$ that somewhat mimic real data, we draw from a log normal with mean around 2500. | |
```{r} | |
set.seed( 13245768 ) | |
s <- round( 10^rnorm( n, log10(2500), .5 ) ) | |
hist( log10(s), 30 ) | |
``` | |
We assume that the cells come from two distinct populations and that the expression of our gene is much higher in one population than in the other. We draw the true values $\lambda_i$ for our example from a mixture of two log-normals. | |
```{r} | |
truelambda <- 10^ifelse( runif(n)<.7, rnorm( n, -3.3, .4 ), rnorm( n, -1.7, .2 ) ) | |
hist( log10(truelambda), 30 ) | |
``` | |
Now we can draw the counts according to @eq-p. | |
```{r} | |
k <- rpois( n, truelambda*s ) | |
``` | |
Here is a plot of the estimated fractions $k_i/s_i$, with a logarithmic x axis and a log1p-scaled y axis: | |
```{r} | |
plot( s, log( k/s + 1e-4 ), cex=.3, log="x", yaxt="n", ylab="k/s" ) | |
tics <- c( 0, .0001, .0003, .001, .003, .01, .03, .1 ) | |
axis( 2, log( tics + 1e-4 ), tics, las=2 ) | |
``` | |
To be able to get a histogram of the estimated fractions, it is customary in the field to | |
transform them according to $y_i = \log_{10}\left(k_i/s_i\cdot 10^4 + 1\right)$ (as we have also | |
done above for the y axis). | |
```{r} | |
hist( log10( k/s*1e4 + 1 ) ) | |
``` | |
This histogram shows three modes, giving the misleading impression that the cells are drawn | |
from three population, while our actual mixture distribution used above had only two components. | |
Avoiding such issues is the purpose of this work. | |
## A basis for the distributions | |
In order to be able to describe possible distributions for the $\lambda_i$, we use an $r$-dimensional basis | |
of Gamma distributions with means $\mu_1,\dots,\mu_r$ which increase geometrically as $\mu_{l+1}=\nu\mu_l$. Let's use $\mu_1=10^{-5}$ and $\nu=2$ and have steps up to $\mu_r\approx 1$: | |
```{r} | |
stepsize <- log(2) # log(nu) | |
mu <- exp( seq( log(1e-5), 0, by=stepsize ) ) | |
``` | |
A gamma distribution has two parameters, known as shape and scale. We set the shape parameter, $\kappa$, to a constant value that is related to $\log(\nu)$, e.g., to $\kappa=2/\log(\nu)$. The scale parameter, $\theta$, then has to be set to $\theta_i = \mu_i / \kappa$ to get the desired means, because in a Gamma, $\mu=\kappa\theta$. | |
Here is a plot of these densities | |
```{r} | |
shape <- 2 / stepsize # kappa | |
scale <- mu / shape # theta | |
plot( NULL, xlim=c(-6,1), ylim=c(0,3), xlab="log10(lambda)", ylab="density" ) | |
xg <- seq( -20, 1, length.out=1000 ) | |
for( i in 1:length(mu) ) | |
lines( xg, dgamma( 10^xg, shape=shape, scale=scale[i] )*(10^xg)*log(10) ) | |
``` | |
Note that we had to multiply the densities by their argument in order to account for the logarithmic transformation of the x axis. | |
Also note that the ratio of the spacing the gammas to their width (both taken on the log scale) is proportional to $\kappa \log\nu$, which we have set to 2 above. | |
We wish to describe the bimodal distribution that we drew the $\lambda_i$ from as a mixture of these | |
basis distributions, whose pdf we write as | |
$$ f_{\mathbf{\pi}}(\lambda) = \sum_{l=1}^r \pi_l f_{\text{Ga}}(\lambda|\kappa_l,\theta_l), \quad \text{with }\sum_{l=1}^r\pi_l=1,$$ {#eq-pi} | |
where $f_\text{Ga}(\cdot|\kappa,\theta)$ is the pdf of a Gamma distribution with shape $\kappa$ and scale $\theta$. | |
Our aim is to find a vector of mixture component weights $\pi_l$ that make the observed $k_i$ likely. | |
For this, we use the following fact: | |
If $K|\lambda\sim\text{Pois}(\lambda)$ and $\lambda\sim\text{Ga}(\kappa,\theta)$, then $K\sim \text{NB}(\kappa,1/\theta)$, where $\text{NB}(\kappa,1/\theta)$ is the negative binomial distribution with shape parameter $\kappa$ and inverse scale parameter $1/\theta$. | |
## A Stan model | |
We load RStan | |
```{r} | |
library( rstan ) | |
``` | |
Now we write the model sketched out above as a Stan model. | |
```{stan output.var="stan_model"} | |
data { | |
int<lower=1> n; | |
int<lower=1> ncomp; | |
array[ncomp] real invscale; | |
array[n] int k; | |
array[n] int s; | |
array[ncomp] real mu; | |
} | |
parameters { | |
simplex[ncomp] spi; | |
} | |
model { | |
vector[ncomp] log_spi = log(spi); | |
for( i in 1:n ) { | |
vector[ncomp] lps = log_spi; | |
for( j in 1:ncomp ) { | |
lps[j] += neg_binomial_lpmf( k[i] | | |
invscale[j], invscale[j] / ( mu[j] * s[i] ) ); | |
} | |
target += log_sum_exp(lps); | |
} | |
} | |
``` | |
We run the chains. | |
```{r} | |
fit <- sampling( stan_model, | |
data = list( n=n, ncomp=length(mu), k=k, s=s, mu=mu, invscale=1/scale ), | |
iter=500, cores=4 ) | |
``` | |
Here is the fit result: | |
```{r} | |
fit | |
``` | |
The posterior means for the $\pi_l$ can be extracted thus | |
```{r} | |
spi <- summary(fit)$summary[1:length(mu),"mean"] | |
spi | |
``` | |
We plot the density according to @eq-pi with these values: | |
```{r} | |
xg <- seq( -6, 0, length.out=1000 ) | |
plot( xg, | |
sapply( 1:length(mu), function(i) | |
dgamma( 10^xg, shape=mu[i]/scale[i], scale=scale[i] )*(10^xg)*log(10) ) %*% spi, | |
col="red", type="l", xlab="log10(lambda)", ylab="density") | |
``` | |
Here is this on top of the histograms for the true $\lambda_i$ | |
```{r} | |
hist( log10(truelambda), freq=FALSE, xlab="log10(lambda)", 30, main="" ) | |
lines( xg, | |
sapply( 1:length(mu), function(i) | |
dgamma( 10^xg, shape=mu[i]/scale[i], scale=scale[i] )*(10^xg)*log(10) ) %*% spi, | |
col="red" ) | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment