The idea is to spend “more time” in the area of high posterior volume
Pick a random starting point at \(i=1\) with \(\theta^{(1)}\)
Propose a next possible value of \(\theta\), call it \(\theta'\) from an (easy) proposal distribution
Evaluate if you should accept or reject the proposal
If accepted, go to \(\theta^{'(2)}\), otherwise stay at \(\theta^{(2)} = \theta^{(1)}\)
Rinse and repeat
If we can get an independant draw from \(\theta\), we just take it
That amounts to regular Monte Carlo sampling, like rnorm(1, mean = 0, sd = 1)
Otherwise, we need a rule for evaluating when to accept and when to reject the proposal
We still need a way to evaluate the density at the proposed value (it will be part of the rule)
The big idea: if the proposed \(\theta'\) has a higher plausibility than the current \(\theta\), accept \(\theta'\); if not, sometimes accept \(\theta'\)
Normal-Normal with Known \(\sigma\)
We will start with a known posterior
Let \(y \sim \text{Normal}(\mu, 0.5)\)
Let the prior \(\mu \sim \text{Normal}(0, 2)\)
Let’s say we observe \(y = 5\)
We can immediately compute the posterior \(\mu \mid y \sim \text{Normal}(4.71, 0.49)\)
normal_normal_post <-function(y, sd, prior_mu, prior_sd) {# for the case where sd is known prior_prec <-1/prior_sd^2 data_prec <-1/sd^2 n <-length(y) post_mu <- (prior_prec * prior_mu + data_prec * n *mean(y)) / (prior_prec + n * data_prec) post_prec <- prior_prec + n * data_prec post_sd <-sqrt(1/post_prec)return(list(post_mu = post_mu, post_sd = post_sd))}normal_normal_post(y =5, sd =0.5, prior_mu =0, prior_sd =2) |>unlist() |>round(2)
Compute \(\text{ratio}_f = \frac{f_{\text{prop}}}{f_{\text{current}}}\) and \(\text{ratio}_q = \frac{q(\theta\mid\theta')}{q(\theta'\mid\theta)}\)
Asseptance probability \(\alpha = \min\left\lbrace 1, \text{ratio}_f \cdot \text{ratio}_q \right\rbrace\)
If \(q\) is symmetric, we can drop \(\text{ratio}_q\), in which case \(\alpha = \min\left\lbrace 1, \text{ratio}_f \right\rbrace\)
If \(\text{ratio} \geq 1\), accept \(\theta'\), else flip a coin with \(\text{Pr}(X=1) = \alpha\) and accept \(\theta'\) if \(X=1\), stay with current \(\theta\), if \(X=0\)
Why in case of normals, \(q(\theta'\mid\theta) = q(\theta\mid\theta')\)
metropolis_iteration_1 <-function(y, current_mean, proposal_scale, sd, prior_mean, prior_sd) {# assume prior ~ N(prior_mean, prior_sd) and sd is known# proposal sampling distribution q = N(current_mean, proposal_scale) proposal_mean <-rnorm(n =1, mean = current_mean, sd = proposal_scale) # q(mu' | mu) f_proposal <-dnorm(proposal_mean, mean = prior_mean, # proposal prior, f(theta')sd = prior_sd) *dnorm(y, mean = proposal_mean, sd = sd) # proposal lik, f(y | theta') f_current <-dnorm(current_mean, mean = prior_mean, # current prior, f(theta)sd = prior_sd) *dnorm(y, mean = current_mean, sd = sd) # current lik, f(y | theta) ratio <- f_proposal / f_current # [f(theta') * f(y | theta')] / [f(theta) * f(y | theta)] alpha <-min(ratio, 1) if (alpha >runif(1)) { # this is just another way of flipping a coint next_value <- proposal_mean } else { next_value <- current_mean }return(next_value)}
mhr <-function(y, f, N, start, ...) {# y: new observation# f: function that implements one MHR iteration# N: number of iterations# start: initial value of the chain# ...: additional arguments to f draws <-numeric(N) draws[1] <-f(y, current_mean = start, ...)for (i in2:N) { draws[i] <-f(y, current_mean = draws[i -1], ...) }return(draws)}y <-5; N <-5e3; start <-3d <-mhr(y, N, f = metropolis_iteration_1, start, proposal_scale =2, sd =0.5, prior_mean =0, prior_sd =2)
Markov Chain Animation
Autocorrelation function (ACF)
Comparing Samples to the True Posterior
np <-normal_normal_post(y = y, sd =0.5, prior_mu =0, prior_sd =2)theta <-seq(np$post_mu +3*np$post_sd, np$post_mu -3*np$post_sd, len =100)dn <-dnorm(theta, mean = np$post_mu, sd = np$post_sd)p <-ggplot(aes(d), data =tibble(d = d))p +geom_histogram(aes(y =after_stat(density)), bins =25, alpha =0.6) +geom_line(aes(theta, dn), linewidth =0.5, color ='red', data =tibble(theta, dn)) +ylab("") +xlab(expression(theta)) +ggtitle("MHR draws vs Normal(4.71, 0.49)")
What is Wrong with this Code
metropolis_iteration_1 <-function(y, current_mean, proposal_scale, sd, prior_mean, prior_sd) {# assume prior ~ N(prior_mean, prior_sd) and sd is known# proposal sampling distribution q = N(current_mean, proposal_scale) proposal_mean <-rnorm(1, current_mean, proposal_scale) # q(mu' | mu) f_proposal <-dnorm(proposal_mean, mean = prior_mean, # proposal prior, f(theta')sd = prior_sd) *dnorm(y, mean = proposal_mean, sd = sd) # proposal lik, f(y | theta') f_current <-dnorm(current_mean, mean = prior_mean, # current prior, f(theta)sd = prior_sd) *dnorm(y, mean = current_mean, sd = sd) # current lik, f(y | theta) ratio <- f_proposal / f_current # [f(theta') * f(y | theta')] / [f(theta) * f(y | theta)] alpha <-min(ratio, 1) if (alpha >runif(1)) { # this is just another way of flipping a coint next_value <- proposal_mean } else { next_value <- current_mean }return(next_value)}
What if Y is a vector?
Recall the likelihood of many independent observations, is the product of their individual likelihoods \[
\begin{eqnarray}
f(y \mid \mu) & = & \prod_{i=1}^{n}\frac{1}{\sqrt{2 \pi} \
\sigma} \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \mu}{\sigma} \right)^2 \right) \\
f(\mu \mid y) & \propto & \text{prior} \cdot \prod_{i=1}^{n}\frac{1}{
\sigma} \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \mu}{\sigma} \right)^2 \right) \\
\end{eqnarray}
\]
This suggests the following change:
metropolis_iteration_2 <-function(y, current_mean, proposal_scale, sd, prior_mean, prior_sd) { proposal_mean <-rnorm(1, current_mean, proposal_scale) f_proposal <-dnorm(proposal_mean, mean = prior_mean, sd = prior_sd) *prod(dnorm(y, mean = proposal_mean, sd = sd)) f_current <-dnorm(current_mean, mean = prior_mean, sd = prior_sd) *prod(dnorm(y, mean = current_mean, sd = sd))if ((f_proposal || f_current) ==0) {return("Error: underflow") # on my computer double.xmin = 2.225074e-308 } ratio <- f_proposal / f_current alpha <-min(ratio, 1) if (alpha >runif(1)) { # definitely go if f_mu_prime > f_mu: alpha = 1 next_value <- proposal_mean # if alpha < 1, go if alpha > U(0, 1) } else { next_value <- current_mean }return(next_value)}
Handling Underflow
set.seed(123)y <-rnorm(300, mean =2, sd =0.55)normal_normal_post(y = y, sd =0.55, prior_mu =0, prior_sd =1) |>unlist() |>round(2)
d <-mhr(y, N, f = metropolis_iteration_log, start, proposal_scale =2, sd =0.55, prior_mean =0, prior_sd =1)mean(d) |>round(2)
[1] 2.02
sd(d) |>round(2)
[1] 0.03
Why Does the Algorithm Work
To show why the algorithm works, we need to show:
The chain has stationary distribution and it is unique
The stationary distribution is our target distribution \(f(\theta \mid y)\)
Condition (a) requires some theory of Markov Chains, but the conditions are mild and are generally satisfied in practice. (See Chapters 11 and 12 in Blitzstein and Hwang for more)
We will show an outline of the proof for (b)
Why Does the Algorithm Work
We will only consider the case of symmetric q
Suppose you sample two points from the PMF \(f(\theta \mid y)\), \(\theta_a\) and \(\theta_b\) and assume we are at time \(t-1\)
Let the probability of going from \(\theta_a\) to \(\theta_b\) be: \(\P(\theta^t = \theta_b) \mid \theta^{t-1} = \theta_a) := p_{a b}\) and the reverse jump: \(\P(\theta^t = \theta_a) \mid \theta^{t-1} = \theta_b) := p_{b a}\)
Two sources of variability are accounted for: sampling variability in \(\tilde{y}\) weighted by posterior variability in \(\theta\)
Given draws from \(f(\theta \mid y)\), \(\theta^{(m)} \sim f(\theta \mid y)\), we can compute the integral in a usual way: \(f(\tilde{y} \mid y) \approx \frac{1}{M} \sum_{m = 1}^M f(\tilde{y} \mid \theta^{(m)})\)
Introducing Posterior Predictive Distribution
For simple models we can often evalute \(\int f(\theta \mid y) f(\tilde{y} \mid \theta) \, d\theta\) directly
We will use an inderect approach using our example Normal-Normal model with known variance
Since we already know \(f(\theta | y)\), we will sample \(\theta\) from it, and then sample \(\tilde{y}\), from \(f(\tilde{y} \mid \theta)\)
For \(y = 5\), the posterior \(\mu | y \sim \text{Normal}(4.71, 0.49)\)
ppd <-function(post_mu, post_sd) { mu <-rnorm(1, mean = post_mu, sd = post_sd) y <-rnorm(1, mean = mu, sd =0.5)}pd <-normal_normal_post(y =5, sd =0.5, prior_mu =0, prior_sd =2)y <-replicate(1e5, ppd(pd$post_mu, pd$post_sd))cat("y_tilde | y ~ Normal(", round(mean(y), 2), ",", round(sd(y), 2), ")", sep ="")
y_tilde | y ~ Normal(4.71,0.7)
It can be shown that the posterior predictive distribution will have the same mean as the posterior distribution and the variance as the sum of the posterior variance and data variance:
You can compute posterior predictive distribution in R, Stan, and rstanarm
Here, we will see how to do it in Stan
We will show an example in rstanarm in the next lecture
data {real y;real<lower=0> sigma;}parameters {real mu;}model { mu ~ normal(0, 2); y ~ normal(mu, sigma);}generated quantities {real y_tilde = normal_rng(mu, sigma);}
Posterior Predictions in Stan
library(cmdstanr)m1 <-cmdstan_model("stan/normal_pred.stan") # compile the modeldata <-list(y =5, sigma =0.5)f1 <- m1$sample( # for other options to sample, help(sample)data = data, # pass data as a list, match the vars name to Stanseed =123, # to reproduce results, Stan does not rely on R's seedchains =4, # total chains, the more, the betterparallel_chains =4, # for multi-processor CPUsrefresh =0, # number of iterations printed on the screeniter_warmup =500, # number of draws for warmup (per chain)iter_sampling =2000# number of draws for samples (per chain))
Running MCMC with 4 parallel chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.2 seconds.