Created
June 17, 2020 13:49
-
-
Save ahlusar1989/baf4c9d963f078d4712a297005cf8f45 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
generator: pandoc | |
title: Gibbs sampling | |
viewport: 'width=device-width, initial-scale=1' | |
--- | |
::: {.container-fluid .main-container} | |
::: {#header .fluid-row} | |
Gibbs sampling {#gibbs-sampling .title .toc-ignore} | |
============== | |
::: | |
::: {#lesson-5.1 .section .level1} | |
So far, we have demonstrated MCMC for a single parameter. What if we | |
seek the posterior distribution of multiple parameters, and that | |
posterior distribution does not have a standard form? One option is to | |
perform Metropolis-Hastings (M-H) by sampling candidates for all | |
parameters at once, and accepting or rejecting all of those candidates | |
together. While this is possible, it can get complicated. Another | |
(simpler) option is to sample the parameters one at a time. | |
As a simple example, suppose we have a joint posterior distribution for | |
two parameters [\\(\\theta\\)]{.math .inline} and [\\(\\phi\\)]{.math | |
.inline}, written [\\(p(\\theta, \\phi \\mid y) \\propto g(\\theta, | |
\\phi)\\)]{.math .inline}. If we knew the value of [\\(\\phi\\)]{.math | |
.inline}, then we would just draw a candidate for [\\(\\theta\\)]{.math | |
.inline} and use [\\(g(\\theta, \\phi)\\)]{.math .inline} to compute our | |
Metropolis-Hastings ratio, and possibly accept the candidate. Before | |
moving on to the next iteration, if we don't know [\\(\\phi\\)]{.math | |
.inline}, then we can perform a similar update for it. Draw a candidate | |
for [\\(\\phi\\)]{.math .inline} using some proposal distribution and | |
again use [\\(g(\\theta, \\phi)\\)]{.math .inline} to compute our | |
Metropolis-Hastings ratio. Here we pretend we know the value of | |
[\\(\\theta\\)]{.math .inline} by substituting its current iteration | |
from the Markov chain. Once we've drawn for both [\\(\\theta\\)]{.math | |
.inline} and [\\(\\phi\\)]{.math .inline}, that completes one iteration | |
and we begin the next iteration by drawing a new [\\(\\theta\\)]{.math | |
.inline}. In other words, we're just going back and forth, updating the | |
parameters one at a time, plugging the current value of the other | |
parameter into [\\(g(\\theta, \\phi)\\)]{.math .inline}. | |
This idea of one-at-a-time updates is used in what we call *Gibbs | |
sampling*, which also produces a stationary Markov chain (whose | |
stationary distribution is the posterior). If you recall, this is the | |
namesake of `JAGS`, "just another Gibbs sampler." | |
::: {#full-conditional-distributions .section .level3} | |
### Full conditional distributions | |
Before describing the full Gibbs sampling algorithm, there's one more | |
thing we can do. Using the chain rule of probability, we have | |
[\\(p(\\theta, \\phi \\mid y) = p(\\theta \\mid \\phi, y) \\cdot p(\\phi | |
\\mid y)\\)]{.math .inline}. Notice that the only difference between | |
[\\(p(\\theta, \\phi \\mid y)\\)]{.math .inline} and [\\(p(\\theta \\mid | |
\\phi, y)\\)]{.math .inline} is multiplication by a factor that doesn't | |
involve [\\(\\theta\\)]{.math .inline}. Since the [\\(g(\\theta, | |
\\phi)\\)]{.math .inline} function above, when viewed as a function of | |
[\\(\\theta\\)]{.math .inline} is proportional to both these | |
expressions, we might as well have replaced it with [\\(p(\\theta \\mid | |
\\phi, y)\\)]{.math .inline} in our update for [\\(\\theta\\)]{.math | |
.inline}. | |
This distribution [\\(p(\\theta \\mid \\phi, y)\\)]{.math .inline} is | |
called the full conditional distribution for [\\(\\theta\\)]{.math | |
.inline}. Why use it instead of [\\(g(\\theta, \\phi)\\)]{.math | |
.inline}? In some cases, the full conditional distribution is a standard | |
distribution we know how to sample. If that happens, we no longer need | |
to draw a candidate and decide whether to accept it. In fact, if we | |
treat the full conditional distribution as a candidate proposal | |
distribution, the resulting Metropolis-Hastings acceptance probability | |
becomes exactly 1. | |
Gibbs samplers require a little more work up front because you need to | |
find the full conditional distribution for each parameter. The good news | |
is that all full conditional distributions have the same starting point: | |
the full joint posterior distribution. Using the example above, we have | |
[\\\[ p(\\theta \\mid \\phi, y) \\propto p(\\theta, \\phi \\mid y) | |
\\\]]{.math .display} where we simply now treat [\\(\\phi\\)]{.math | |
.inline} as a known number. Likewise, the other full conditional is | |
[\\(p(\\phi \\mid \\theta, y) \\propto p(\\theta, \\phi \\mid | |
y)\\)]{.math .inline} where here, we consider [\\(\\theta\\)]{.math | |
.inline} to be a known number. We always start with the full posterior | |
distribution. Thus, the process of finding full conditional | |
distributions is the same as finding the posterior distribution of each | |
parameter, pretending that all other parameters are known. | |
::: | |
::: {#gibbs-sampler .section .level3} | |
### Gibbs sampler | |
The idea of Gibbs sampling is that we can update multiple parameters by | |
sampling just one parameter at a time, cycling through all parameters | |
and repeating. To perform the update for one particular parameter, we | |
substitute in the current values of all other parameters. | |
Here is the algorithm. Suppose we have a joint posterior distribution | |
for two parameters [\\(\\theta\\)]{.math .inline} and | |
[\\(\\phi\\)]{.math .inline}, written [\\(p(\\theta, \\phi \\mid | |
y)\\)]{.math .inline}. If we can find the distribution of each parameter | |
at a time, i.e., [\\(p(\\theta \\mid \\phi, y)\\)]{.math .inline} and | |
[\\(p(\\phi \\mid \\theta, y)\\)]{.math .inline}, then we can take turns | |
sampling these distributions like so: | |
1. Using [\\(\\phi\_{i-1}\\)]{.math .inline}, draw | |
[\\(\\theta\_i\\)]{.math .inline} from [\\(p(\\theta \\mid \\phi = | |
\\phi\_{i-1}, y)\\)]{.math .inline}. | |
2. Using [\\(\\theta\_i\\)]{.math .inline}, draw [\\(\\phi\_i\\)]{.math | |
.inline} from [\\(p(\\phi \\mid \\theta = \\theta\_i, y)\\)]{.math | |
.inline}. | |
Together, steps 1 and 2 complete one cycle of the Gibbs sampler and | |
produce the draw for [\\((\\theta\_i, \\phi\_i)\\)]{.math .inline} in | |
one iteration of a MCMC sampler. If there are more than two parameters, | |
we can handle that also. One Gibbs cycle would include an update for | |
each of the parameters. | |
In the following segments, we will provide a concrete example of finding | |
full conditional distributions and constructing a Gibbs sampler. | |
::: | |
::: | |
::: {#lesson-5.2 .section .level1} | |
::: {#normal-likelihood-unknown-mean-and-variance .section .level3} | |
### Normal likelihood, unknown mean and variance | |
Let's return to the example where we have normal likelihood with unknown | |
mean and unknown variance. The model is [\\\[ \\begin{align} y\_i \\mid | |
\\mu, \\sigma\^2 &\\overset{\\text{iid}}{\\sim} \\text{N} ( \\mu, | |
\\sigma\^2 ), \\quad i=1,\\ldots,n \\\\ \\mu &\\sim \\text{N}(\\mu\_0, | |
\\sigma\_0\^2) \\\\ \\sigma\^2 &\\sim \\text{IG}(\\nu\_0, \\beta\_0) \\, | |
. \\end{align} \\\]]{.math .display} We chose a normal prior for | |
[\\(\\mu\\)]{.math .inline} because, in the case where | |
[\\(\\sigma\^2\\)]{.math .inline} is known, the normal is the conjugate | |
prior for [\\(\\mu\\)]{.math .inline}. Likewise, in the case where | |
[\\(\\mu\\)]{.math .inline} is known, the inverse-gamma is the conjugate | |
prior for [\\(\\sigma\^2\\)]{.math .inline}. This will give us | |
convenient full conditional distributions in a Gibbs sampler. | |
Let's first work out the form of the full posterior distribution. When | |
we begin analyzing data, the `JAGS` software will complete this step for | |
us. However, it is extremely valuable to see and understand how this | |
works. | |
[\\\[ \\begin{align} p( \\mu, \\sigma\^2 \\mid y\_1, y\_2, \\ldots, y\_n | |
) &\\propto p(y\_1, y\_2, \\ldots, y\_n \\mid \\mu, \\sigma\^2) p(\\mu) | |
p(\\sigma\^2) \\\\ &= \\prod\_{i=1}\^n \\text{N} ( y\_i \\mid \\mu, | |
\\sigma\^2 ) \\times \\text{N}( \\mu \\mid \\mu\_0, \\sigma\_0\^2) | |
\\times \\text{IG}(\\sigma\^2 \\mid \\nu\_0, \\beta\_0) \\\\ &= | |
\\prod\_{i=1}\^n \\frac{1}{\\sqrt{2\\pi\\sigma\^2}}\\exp \\left\[ | |
-\\frac{(y\_i - \\mu)\^2}{2\\sigma\^2} \\right\] \\times | |
\\frac{1}{\\sqrt{2\\pi\\sigma\_0\^2}} \\exp \\left\[ -\\frac{(\\mu - | |
\\mu\_0)\^2}{2\\sigma\_0\^2} \\right\] \\times | |
\\frac{\\beta\_0\^{\\nu\_0}}{\\Gamma(\\nu\_0)}(\\sigma\^2)\^{-(\\nu\_0 + | |
1)} \\exp \\left\[ -\\frac{\\beta\_0}{\\sigma\^2} \\right\] | |
I\_{\\sigma\^2 \> 0}(\\sigma\^2) \\\\ &\\propto (\\sigma\^2)\^{-n/2} | |
\\exp \\left\[ -\\frac{\\sum\_{i=1}\^n (y\_i - \\mu)\^2}{2\\sigma\^2} | |
\\right\] \\exp \\left\[ -\\frac{(\\mu - \\mu\_0)\^2}{2\\sigma\_0\^2} | |
\\right\] (\\sigma\^2)\^{-(\\nu\_0 + 1)} \\exp \\left\[ | |
-\\frac{\\beta\_0}{\\sigma\^2} \\right\] I\_{\\sigma\^2 \> | |
0}(\\sigma\^2) \\end{align} \\\]]{.math .display} | |
From here, it is easy to continue on to find the two full conditional | |
distributions we need. First let's look at [\\(\\mu\\)]{.math .inline}, | |
assuming [\\(\\sigma\^2\\)]{.math .inline} is known (in which case it | |
becomes a constant and is absorbed into the normalizing constant): | |
[\\\[ \\begin{align} p(\\mu \\mid \\sigma\^2, y\_1, \\ldots, y\_n) | |
&\\propto p( \\mu, \\sigma\^2 \\mid y\_1, \\ldots, y\_n ) \\\\ &\\propto | |
\\exp \\left\[ -\\frac{\\sum\_{i=1}\^n (y\_i - \\mu)\^2}{2\\sigma\^2} | |
\\right\] \\exp \\left\[ -\\frac{(\\mu - \\mu\_0)\^2}{2\\sigma\_0\^2} | |
\\right\] \\\\ &\\propto \\exp \\left\[ -\\frac{1}{2} \\left( \\frac{ | |
\\sum\_{i=1}\^n (y\_i - \\mu)\^2}{2\\sigma\^2} + \\frac{(\\mu - | |
\\mu\_0)\^2}{2\\sigma\_0\^2} \\right) \\right\] \\\\ &\\propto \\text{N} | |
\\left( \\mu \\mid \\frac{n\\bar{y}/\\sigma\^2 + | |
\\mu\_0/\\sigma\_0\^2}{n/\\sigma\^2 + 1/\\sigma\_0\^2}, \\, | |
\\frac{1}{n/\\sigma\^2 + 1/\\sigma\_0\^2} \\right) \\, , \\end {align} | |
\\\]]{.math .display} which we derived in the supplementary material of | |
the last course. So, given the data and [\\(\\sigma\^2\\)]{.math | |
.inline}, [\\(\\mu\\)]{.math .inline} follows this normal distribution. | |
Now let's look at [\\(\\sigma\^2\\)]{.math .inline}, assuming | |
[\\(\\mu\\)]{.math .inline} is known: [\\\[ \\begin{align} p(\\sigma\^2 | |
\\mid \\mu, y\_1, \\ldots, y\_n) &\\propto p( \\mu, \\sigma\^2 \\mid | |
y\_1, \\ldots, y\_n ) \\\\ &\\propto (\\sigma\^2)\^{-n/2} \\exp \\left\[ | |
-\\frac{\\sum\_{i=1}\^n (y\_i - \\mu)\^2}{2\\sigma\^2} \\right\] | |
(\\sigma\^2)\^{-(\\nu\_0 + 1)} \\exp \\left\[ | |
-\\frac{\\beta\_0}{\\sigma\^2} \\right\] I\_{\\sigma\^2 \> | |
0}(\\sigma\^2) \\\\ &\\propto (\\sigma\^2)\^{-(\\nu\_0 + n/2 + 1)} \\exp | |
\\left\[ -\\frac{1}{\\sigma\^2} \\left( \\beta\_0 + | |
\\frac{\\sum\_{i=1}\^n (y\_i - \\mu)\^2}{2} \\right) \\right\] | |
I\_{\\sigma\^2 \> 0}(\\sigma\^2) \\\\ &\\propto \\text{IG}\\left( | |
\\sigma\^2 \\mid \\nu\_0 + \\frac{n}{2}, \\, \\beta\_0 + | |
\\frac{\\sum\_{i=1}\^n (y\_i - \\mu)\^2}{2} \\right) \\, . \\end{align} | |
\\\]]{.math .display} | |
These two distributions provide the basis of a Gibbs sampler to simulate | |
from a Markov chain whose stationary distribution is the full posterior | |
of both [\\(\\mu\\)]{.math .inline} and [\\(\\sigma\^2\\)]{.math | |
.inline}. We simply alternate draws between these two parameters, using | |
the most recent draw of one parameter to update the other. | |
We will do this in `R` in the next segment. | |
::: | |
::: | |
::: {#lesson-5.3 .section .level1} | |
::: {#gibbs-sampler-in-r .section .level3} | |
### Gibbs sampler in `R` | |
To implement the Gibbs sampler we just described, let's return to our | |
running example where the data are the percent change in total personnel | |
from last year to this year for [\\(n=10\\)]{.math .inline} companies. | |
We'll still use a normal likelihood, but now we'll relax the assumption | |
that we know the variance of growth between companies, | |
[\\(\\sigma\^2\\)]{.math .inline}, and estimate that variance. Instead | |
of the [\\(t\\)]{.math .inline} prior from earlier, we will use the | |
conditionally conjugate priors, normal for [\\(\\mu\\)]{.math .inline} | |
and inverse-gamma for [\\(\\sigma\^2\\)]{.math .inline}. | |
The first step will be to write functions to simulate from the full | |
conditional distributions we derived in the previous segment. The full | |
conditional for [\\(\\mu\\)]{.math .inline}, given | |
[\\(\\sigma\^2\\)]{.math .inline} and data is [\\\[ \\text{N} \\left( | |
\\mu \\mid \\frac{n\\bar{y}/\\sigma\^2 + | |
\\mu\_0/\\sigma\_0\^2}{n/\\sigma\^2 + 1/\\sigma\_0\^2}, \\, | |
\\frac{1}{n/\\sigma\^2 + 1/\\sigma\_0\^2} \\right) \\\]]{.math .display} | |
``` {.r} | |
update_mu = function(n, ybar, sig2, mu_0, sig2_0) { | |
sig2_1 = 1.0 / (n / sig2 + 1.0 / sig2_0) | |
mu_1 = sig2_1 * (n * ybar / sig2 + mu_0 / sig2_0) | |
rnorm(n=1, mean=mu_1, sd=sqrt(sig2_1)) | |
} | |
``` | |
The full conditional for [\\(\\sigma\^2\\)]{.math .inline} given | |
[\\(\\mu\\)]{.math .inline} and data is [\\\[ \\text{IG}\\left( | |
\\sigma\^2 \\mid \\nu\_0 + \\frac{n}{2}, \\, \\beta\_0 + | |
\\frac{\\sum\_{i=1}\^n (y\_i - \\mu)\^2}{2} \\right) \\\]]{.math | |
.display} | |
``` {.r} | |
update_sig2 = function(n, y, mu, nu_0, beta_0) { | |
nu_1 = nu_0 + n / 2.0 | |
sumsq = sum( (y - mu)^2 ) # vectorized | |
beta_1 = beta_0 + sumsq / 2.0 | |
out_gamma = rgamma(n=1, shape=nu_1, rate=beta_1) # rate for gamma is shape for inv-gamma | |
1.0 / out_gamma # reciprocal of a gamma random variable is distributed inv-gamma | |
} | |
``` | |
With functions for drawing from the full conditionals, we are ready to | |
write a function to perform Gibbs sampling. | |
``` {.r} | |
gibbs = function(y, n_iter, init, prior) { | |
ybar = mean(y) | |
n = length(y) | |
## initialize | |
mu_out = numeric(n_iter) | |
sig2_out = numeric(n_iter) | |
mu_now = init$mu | |
## Gibbs sampler | |
for (i in 1:n_iter) { | |
sig2_now = update_sig2(n=n, y=y, mu=mu_now, nu_0=prior$nu_0, beta_0=prior$beta_0) | |
mu_now = update_mu(n=n, ybar=ybar, sig2=sig2_now, mu_0=prior$mu_0, sig2_0=prior$sig2_0) | |
sig2_out[i] = sig2_now | |
mu_out[i] = mu_now | |
} | |
cbind(mu=mu_out, sig2=sig2_out) | |
} | |
``` | |
Now we are ready to set up the problem in `R`. | |
``` {.r} | |
y = c(1.2, 1.4, -0.5, 0.3, 0.9, 2.3, 1.0, 0.1, 1.3, 1.9) | |
ybar = mean(y) | |
n = length(y) | |
## prior | |
prior = list() | |
prior$mu_0 = 0.0 | |
prior$sig2_0 = 1.0 | |
prior$n_0 = 2.0 # prior effective sample size for sig2 | |
prior$s2_0 = 1.0 # prior point estimate for sig2 | |
prior$nu_0 = prior$n_0 / 2.0 # prior parameter for inverse-gamma | |
prior$beta_0 = prior$n_0 * prior$s2_0 / 2.0 # prior parameter for inverse-gamma | |
hist(y, freq=FALSE, xlim=c(-1.0, 3.0)) # histogram of the data | |
curve(dnorm(x=x, mean=prior$mu_0, sd=sqrt(prior$sig2_0)), lty=2, add=TRUE) # prior for mu | |
points(y, rep(0,n), pch=1) # individual data points | |
points(ybar, 0, pch=19) # sample mean | |
``` | |
{width="672"} | |
Finally, we can initialize and run the sampler! | |
``` {.r} | |
set.seed(53) | |
init = list() | |
init$mu = 0.0 | |
post = gibbs(y=y, n_iter=1e3, init=init, prior=prior) | |
``` | |
``` {.r} | |
head(post) | |
``` | |
## mu sig2 | |
## [1,] 0.3746992 1.5179144 | |
## [2,] 0.4900277 0.8532821 | |
## [3,] 0.2536817 1.4325174 | |
## [4,] 1.1378504 1.2337821 | |
## [5,] 1.0016641 0.8409815 | |
## [6,] 1.1576873 0.7926196 | |
``` {.r} | |
library("coda") | |
plot(as.mcmc(post)) | |
``` | |
{width="672"} | |
``` {.r} | |
summary(as.mcmc(post)) | |
``` | |
## | |
## Iterations = 1:1000 | |
## Thinning interval = 1 | |
## Number of chains = 1 | |
## Sample size per chain = 1000 | |
## | |
## 1. Empirical mean and standard deviation for each variable, | |
## plus standard error of the mean: | |
## | |
## Mean SD Naive SE Time-series SE | |
## mu 0.9051 0.2868 0.00907 0.00907 | |
## sig2 0.9282 0.5177 0.01637 0.01810 | |
## | |
## 2. Quantiles for each variable: | |
## | |
## 2.5% 25% 50% 75% 97.5% | |
## mu 0.3024 0.7244 0.9089 1.090 1.481 | |
## sig2 0.3577 0.6084 0.8188 1.094 2.141 | |
As with the Metropolis-Hastings example, these chains appear to have | |
converged. | |
::: | |
::: | |
::: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment