simon-anders · October 23, 2024 18:04
diff --git a/sc_density_stan.qmd b/sc_density_stan.qmd
 ---
 format:
  html:
    embed-resources: true
 ---    

 # Density estimation for single-cell gene expression with MCMC

 Simon Anders, 2024-10-23

 ## Setting

 In single-cell transcriptomics, we typically have a matrix $k_{ij}$ of count data, where $k_{ij}$ is the number of sequencing reads (i.e., mRNA transcript molecules) from gene $j$ that we have detected among all the sequencing reads originating from cell $i$. We put these into relation to the total count $s_i = \sum_j k_{ij}$ of reads from cell $j$:
 $\quad$ $k_{ij}/s_i$ estimates which fraction of the mRNA molecules in cell $i$ originate from gene $j$.

 We assume that each cell $j$ has some abstract "state" that implies a vector $(\lambda_{i1},...,\lambda_{im})$ of expected fractions (with $\sum_j\lambda_{ij}=1$), and that the actual abundance of mRNA molecules in the cell is a multinomially-distributed variable with this proportions vector. When sequencing, we detect each molecule only with a certain probability, but if we assume this probability to be the same for all genes, we may conclude that the distribution of the vector $(k_{i1},\dots,k_{im})$ is drawn from a multinomial distribution with total $s_i$ and proportions vector $(\lambda_{i1},...,\lambda_{im})$, if we condition on $s_i$.

 If we consider only a single gene $j$, it is helpful to consider $s_i$ as a known constant (i.e., as a variable we always implicitely condition on) and to ignore the counts for the other genes, so that we do not have to ensure that the counts sum to $s_i$. Then, we may simply write
 $$ k_i | \lambda_i\sim \text{Pois}(s_i \lambda_i), $$ {#eq-p}
 where we have dropped the index $j$ as we, from now on, will focus on only one gene.

 ## Task

 Given observed counts $k_i$ for some gene and observed totals $s_i$, for $n$ cells indexed as $i=1,\dots,n$, estimate the distribution of the $\lambda_i$ in @eq-p.

 ## Example data

 Let us consider $n=1000$ cells.

 ```{r}
 n <- 1000
 ```

 To get example values for the totals $s_i$ that somewhat mimic real data, we draw from a log normal with mean around 2500.

 ```{r}
 set.seed( 13245768 )

 s <- round( 10^rnorm( n, log10(2500), .5 ) )

 hist( log10(s), 30 )
 ```

 We assume that the cells come from two distinct populations and that the expression of our gene is much higher in one population than in the other. We draw the true values $\lambda_i$ for our example from a mixture of two log-normals.

 ```{r}
 truelambda <- 10^ifelse( runif(n)<.7, rnorm( n, -3.3, .4 ), rnorm( n, -1.7, .2 ) )

 hist( log10(truelambda), 30 )
 ```

 Now we can draw the counts according to @eq-p.

 ```{r}
 k <- rpois( n, truelambda*s )
 ```

 Here is a plot of the estimated fractions $k_i/s_i$, with a logarithmic x axis and a log1p-scaled y axis:

 ```{r}
 plot( s, log( k/s + 1e-4 ), cex=.3, log="x", yaxt="n", ylab="k/s" )

 tics <- c( 0, .0001, .0003, .001, .003, .01, .03, .1 )
 axis( 2, log( tics + 1e-4 ), tics, las=2 )
 ```

 To be able to get a histogram of the estimated fractions, it is customary in the field to
 transform them according to $y_i = \log_{10}\left(k_i/s_i\cdot 10^4 + 1\right)$ (as we have also
 done above for the y axis). 

 ```{r}
 hist( log10( k/s*1e4 + 1 ) )
 ```

 This histogram shows three modes, giving the misleading impression that the cells are drawn
 from three population, while our actual mixture distribution used above had only two components.
 Avoiding such issues is the purpose of this work.

 ## A basis for the distributions

 In order to be able to describe possible distributions for the $\lambda_i$, we use an $r$-dimensional basis
 of Gamma distributions with means $\mu_1,\dots,\mu_r$ which increase geometrically as $\mu_{l+1}=\nu\mu_l$. Let's use $\mu_1=10^{-5}$ and $\nu=2$ and have steps up to $\mu_r\approx 1$:

 ```{r}
 stepsize <- log(2)  # log(nu)
 mu <- exp( seq( log(1e-5), 0, by=stepsize ) )
 ```

 A gamma distribution has two parameters, known as shape and scale. We set the shape parameter, $\kappa$, to a constant value that is related to $\log(\nu)$, e.g., to $\kappa=2/\log(\nu)$. The scale parameter, $\theta$, then has to be set to $\theta_i = \mu_i / \kappa$ to get the desired means, because in a Gamma, $\mu=\kappa\theta$. 

 Here is a plot of these densities

 ```{r}
 shape <- 2 / stepsize  # kappa
 scale <- mu / shape    # theta

 plot( NULL, xlim=c(-6,1), ylim=c(0,3), xlab="log10(lambda)", ylab="density" )
 xg <- seq( -20, 1, length.out=1000 )

 for( i in 1:length(mu) )
  lines( xg, dgamma( 10^xg, shape=shape, scale=scale[i] )*(10^xg)*log(10) )
 ```
 Note that we had to multiply the densities by their argument in order to account for the logarithmic transformation of the x axis.

 Also note that the ratio of the spacing the gammas to their width (both taken on the log scale) is proportional to $\kappa \log\nu$, which we have set to 2 above.

 We wish to describe the bimodal distribution that we drew the $\lambda_i$ from as a mixture of these 
 basis distributions, whose pdf we write as
 $$ f_{\mathbf{\pi}}(\lambda) = \sum_{l=1}^r \pi_l f_{\text{Ga}}(\lambda|\kappa_l,\theta_l), \quad \text{with }\sum_{l=1}^r\pi_l=1,$$  {#eq-pi}
 where $f_\text{Ga}(\cdot|\kappa,\theta)$ is the pdf of a Gamma distribution with shape $\kappa$ and scale $\theta$.

 Our aim is to find a vector of mixture component weights $\pi_l$ that make the observed $k_i$ likely.

 For this, we use the following fact:

 If $K|\lambda\sim\text{Pois}(\lambda)$ and $\lambda\sim\text{Ga}(\kappa,\theta)$, then $K\sim \text{NB}(\kappa,1/\theta)$, where $\text{NB}(\kappa,1/\theta)$ is the negative binomial distribution with shape parameter $\kappa$ and inverse scale parameter $1/\theta$.

 ## A Stan model

 We load RStan

 ```{r}
 library( rstan )
 ```

 Now we write the model sketched out above as a Stan model.

 ```{stan output.var="stan_model"}
 data {
  int<lower=1> n;
  int<lower=1> ncomp;
  array[ncomp] real invscale;
  array[n] int k;
  array[n] int s;
  array[ncomp] real mu;
  
 }
 parameters {
  simplex[ncomp] spi;
 }
 model {
  vector[ncomp] log_spi = log(spi);
  for( i in 1:n ) {
     vector[ncomp] lps = log_spi;
     for( j in 1:ncomp ) {
        lps[j] += neg_binomial_lpmf( k[i] | 
                      invscale[j], invscale[j] / ( mu[j] * s[i] ) );
     }
     target += log_sum_exp(lps);
  }
 }
 ```

 We run the chains. 

 ```{r}
 fit <- sampling( stan_model, 
          data = list( n=n, ncomp=length(mu), k=k, s=s, mu=mu, invscale=1/scale ), 
          iter=500, cores=4 )
 ```

 Here is the fit result:

 ```{r}
 fit
 ```
 The posterior means for the $\pi_l$ can be extracted thus

 ```{r}
 spi <- summary(fit)$summary[1:length(mu),"mean"]
 spi
 ```

 We plot the density according to @eq-pi with these values:

 ```{r}
 xg <- seq( -6, 0, length.out=1000 )      
 plot( xg,
  sapply( 1:length(mu), function(i)
      dgamma( 10^xg, shape=mu[i]/scale[i], scale=scale[i] )*(10^xg)*log(10) ) %*% spi, 
    col="red", type="l", xlab="log10(lambda)", ylab="density")

 ```
 Here is this on top of the histograms for the true $\lambda_i$

 ```{r}
 hist( log10(truelambda), freq=FALSE, xlab="log10(lambda)", 30, main="" )

 lines( xg,
  sapply( 1:length(mu), function(i)
      dgamma( 10^xg, shape=mu[i]/scale[i], scale=scale[i] )*(10^xg)*log(10) ) %*% spi,
    col="red" )
 ```
	---
	format:
	html:
	embed-resources: true
	---

	# Density estimation for single-cell gene expression with MCMC

	Simon Anders, 2024-10-23

	## Setting

	In single-cell transcriptomics, we typically have a matrix $k_{ij}$ of count data, where $k_{ij}$ is the number of sequencing reads (i.e., mRNA transcript molecules) from gene $j$ that we have detected among all the sequencing reads originating from cell $i$. We put these into relation to the total count $s_i = \sum_j k_{ij}$ of reads from cell $j$:
	$\quad$ $k_{ij}/s_i$ estimates which fraction of the mRNA molecules in cell $i$ originate from gene $j$.

	We assume that each cell $j$ has some abstract "state" that implies a vector $(\lambda_{i1},...,\lambda_{im})$ of expected fractions (with $\sum_j\lambda_{ij}=1$), and that the actual abundance of mRNA molecules in the cell is a multinomially-distributed variable with this proportions vector. When sequencing, we detect each molecule only with a certain probability, but if we assume this probability to be the same for all genes, we may conclude that the distribution of the vector $(k_{i1},\dots,k_{im})$ is drawn from a multinomial distribution with total $s_i$ and proportions vector $(\lambda_{i1},...,\lambda_{im})$, if we condition on $s_i$.

	If we consider only a single gene $j$, it is helpful to consider $s_i$ as a known constant (i.e., as a variable we always implicitely condition on) and to ignore the counts for the other genes, so that we do not have to ensure that the counts sum to $s_i$. Then, we may simply write
	$$ k_i \| \lambda_i\sim \text{Pois}(s_i \lambda_i), $$ {#eq-p}
	where we have dropped the index $j$ as we, from now on, will focus on only one gene.

	## Task

	Given observed counts $k_i$ for some gene and observed totals $s_i$, for $n$ cells indexed as $i=1,\dots,n$, estimate the distribution of the $\lambda_i$ in @eq-p.

	## Example data

	Let us consider $n=1000$ cells.

	```{r}
	n <- 1000
	```

	To get example values for the totals $s_i$ that somewhat mimic real data, we draw from a log normal with mean around 2500.

	```{r}
	set.seed( 13245768 )

	s <- round( 10^rnorm( n, log10(2500), .5 ) )

	hist( log10(s), 30 )
	```

	We assume that the cells come from two distinct populations and that the expression of our gene is much higher in one population than in the other. We draw the true values $\lambda_i$ for our example from a mixture of two log-normals.

	```{r}
	truelambda <- 10^ifelse( runif(n)<.7, rnorm( n, -3.3, .4 ), rnorm( n, -1.7, .2 ) )

	hist( log10(truelambda), 30 )
	```

	Now we can draw the counts according to @eq-p.

	```{r}
	k <- rpois( n, truelambda*s )
	```

	Here is a plot of the estimated fractions $k_i/s_i$, with a logarithmic x axis and a log1p-scaled y axis:

	```{r}
	plot( s, log( k/s + 1e-4 ), cex=.3, log="x", yaxt="n", ylab="k/s" )

	tics <- c( 0, .0001, .0003, .001, .003, .01, .03, .1 )
	axis( 2, log( tics + 1e-4 ), tics, las=2 )
	```

	To be able to get a histogram of the estimated fractions, it is customary in the field to
	transform them according to $y_i = \log_{10}\left(k_i/s_i\cdot 10^4 + 1\right)$ (as we have also
	done above for the y axis).

	```{r}
	hist( log10( k/s*1e4 + 1 ) )
	```

	This histogram shows three modes, giving the misleading impression that the cells are drawn
	from three population, while our actual mixture distribution used above had only two components.
	Avoiding such issues is the purpose of this work.

	## A basis for the distributions

	In order to be able to describe possible distributions for the $\lambda_i$, we use an $r$-dimensional basis
	of Gamma distributions with means $\mu_1,\dots,\mu_r$ which increase geometrically as $\mu_{l+1}=\nu\mu_l$. Let's use $\mu_1=10^{-5}$ and $\nu=2$ and have steps up to $\mu_r\approx 1$:

	```{r}
	stepsize <- log(2) # log(nu)
	mu <- exp( seq( log(1e-5), 0, by=stepsize ) )
	```

	A gamma distribution has two parameters, known as shape and scale. We set the shape parameter, $\kappa$, to a constant value that is related to $\log(\nu)$, e.g., to $\kappa=2/\log(\nu)$. The scale parameter, $\theta$, then has to be set to $\theta_i = \mu_i / \kappa$ to get the desired means, because in a Gamma, $\mu=\kappa\theta$.

	Here is a plot of these densities

	```{r}
	shape <- 2 / stepsize # kappa
	scale <- mu / shape # theta

	plot( NULL, xlim=c(-6,1), ylim=c(0,3), xlab="log10(lambda)", ylab="density" )
	xg <- seq( -20, 1, length.out=1000 )

	for( i in 1:length(mu) )
	lines( xg, dgamma( 10^xg, shape=shape, scale=scale[i] )(10^xg)log(10) )
	```
	Note that we had to multiply the densities by their argument in order to account for the logarithmic transformation of the x axis.

	Also note that the ratio of the spacing the gammas to their width (both taken on the log scale) is proportional to $\kappa \log\nu$, which we have set to 2 above.

	We wish to describe the bimodal distribution that we drew the $\lambda_i$ from as a mixture of these
	basis distributions, whose pdf we write as
	$$ f_{\mathbf{\pi}}(\lambda) = \sum_{l=1}^r \pi_l f_{\text{Ga}}(\lambda\|\kappa_l,\theta_l), \quad \text{with }\sum_{l=1}^r\pi_l=1,$$ {#eq-pi}
	where $f_\text{Ga}(\cdot\|\kappa,\theta)$ is the pdf of a Gamma distribution with shape $\kappa$ and scale $\theta$.

	Our aim is to find a vector of mixture component weights $\pi_l$ that make the observed $k_i$ likely.

	For this, we use the following fact:

	If $K\|\lambda\sim\text{Pois}(\lambda)$ and $\lambda\sim\text{Ga}(\kappa,\theta)$, then $K\sim \text{NB}(\kappa,1/\theta)$, where $\text{NB}(\kappa,1/\theta)$ is the negative binomial distribution with shape parameter $\kappa$ and inverse scale parameter $1/\theta$.

	## A Stan model

	We load RStan

	```{r}
	library( rstan )
	```

	Now we write the model sketched out above as a Stan model.

	```{stan output.var="stan_model"}
	data {
	int<lower=1> n;
	int<lower=1> ncomp;
	array[ncomp] real invscale;
	array[n] int k;
	array[n] int s;
	array[ncomp] real mu;

	}
	parameters {
	simplex[ncomp] spi;
	}
	model {
	vector[ncomp] log_spi = log(spi);
	for( i in 1:n ) {
	vector[ncomp] lps = log_spi;
	for( j in 1:ncomp ) {
	lps[j] += neg_binomial_lpmf( k[i] \|
	invscale[j], invscale[j] / ( mu[j] * s[i] ) );
	}
	target += log_sum_exp(lps);
	}
	}
	```

	We run the chains.

	```{r}
	fit <- sampling( stan_model,
	data = list( n=n, ncomp=length(mu), k=k, s=s, mu=mu, invscale=1/scale ),
	iter=500, cores=4 )
	```

	Here is the fit result:

	```{r}
	fit
	```
	The posterior means for the $\pi_l$ can be extracted thus

	```{r}
	spi <- summary(fit)$summary[1:length(mu),"mean"]
	spi
	```

	We plot the density according to @eq-pi with these values:

	```{r}
	xg <- seq( -6, 0, length.out=1000 )
	plot( xg,
	sapply( 1:length(mu), function(i)
	dgamma( 10^xg, shape=mu[i]/scale[i], scale=scale[i] )(10^xg)log(10) ) %*% spi,
	col="red", type="l", xlab="log10(lambda)", ylab="density")

	```
	Here is this on top of the histograms for the true $\lambda_i$

	```{r}
	hist( log10(truelambda), freq=FALSE, xlab="log10(lambda)", 30, main="" )

	lines( xg,
	sapply( 1:length(mu), function(i)
	dgamma( 10^xg, shape=mu[i]/scale[i], scale=scale[i] )(10^xg)log(10) ) %*% spi,
	col="red" )
	```