Bayesian Inference

NYU Applied Statistics for Social Science Research

Eric Novik | Spring 2024 | Lecture 4

MCMC, Posterior inference, and Prediction

Metropolis-Hastings-Rosenbluth algorithm
R implementation and testing
Computational issues
Why does MHR work
Posterior predictive distribution
Posterior predictive simulation in Stan
Optimization and code breaking with MCMC

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathcal{L}} \DeclareMathOperator{\I}{\text{I}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \]

High-Level Outline of MHR

The idea is to spend “more time” in the area of high posterior volume
Pick a random starting point at \(i=1\) with \(\theta^{(1)}\)
Propose a next possible value of \(\theta\), call it \(\theta'\) from an (easy) proposal distribution
Evaluate if you should accept or reject the proposal
If accepted, go to \(\theta^{'(2)}\), otherwise stay at \(\theta^{(2)} = \theta^{(1)}\)
Rinse and repeat

If we can get an independant draw from \(\theta\), we just take it
That amounts to regular Monte Carlo sampling, like rnorm(1, mean = 0, sd = 1)
Otherwise, we need a rule for evaluating when to accept and when to reject the proposal
We still need a way to evaluate the density at the proposed value (it will be part of the rule)
The big idea: if the proposed \(\theta'\) has a higher plausibility than the current \(\theta\), accept \(\theta'\); if not, sometimes accept \(\theta'\)

Normal-Normal with Known \(\sigma\)

We will start with a known posterior
Let \(y \sim \text{Normal}(\mu, 0.5)\)
Let the prior \(\mu \sim \text{Normal}(0, 2)\)
Let’s say we observe \(y = 5\)
We can immediately compute the posterior \(\mu \mid y \sim \text{Normal}(4.71, 0.49)\)

normal_normal_post <- function(y, sd, prior_mu, prior_sd) {
# for the case where sd is known
  prior_prec <- 1/prior_sd^2 
  data_prec <- 1/sd^2
  n <- length(y)
  post_mu <- (prior_prec * prior_mu + data_prec * n * mean(y)) /
             (prior_prec + n * data_prec)
  post_prec <- prior_prec + n * data_prec
  post_sd <- sqrt(1/post_prec)
  return(list(post_mu = post_mu, post_sd = post_sd))
}

normal_normal_post(y = 5, sd = 0.5, prior_mu = 0, prior_sd = 2) |>
  unlist() |> round(2)

post_mu post_sd 
   4.71    0.49

We are trying to generate draws: \(\left( \theta^{(1)}, \theta^{(2)}, ..., \theta^{(N)}\right)\) implied by the target density \(f\)
Pick the first value of \(\theta\) randomly or deterministically
Draw \(\theta'\) from a proposal distribution \(q(\theta'\mid\theta)\)
Compute the unnormalized \(f(\theta'\mid y) \propto f(y \mid \theta') f(\theta') = f_{\text{prop}}\)
Compute the unnormalized \(f(\theta \mid y) \propto f(y \mid \theta) f(\theta) = f_{\text{current}}\)
Compute \(\text{ratio}_f = \frac{f_{\text{prop}}}{f_{\text{current}}}\) and \(\text{ratio}_q = \frac{q(\theta\mid\theta')}{q(\theta'\mid\theta)}\)
Asseptance probability \(\alpha = \min\left\lbrace 1, \text{ratio}_f \cdot \text{ratio}_q \right\rbrace\)
If \(q\) is symmetric, we can drop \(\text{ratio}_q\), in which case \(\alpha = \min\left\lbrace 1, \text{ratio}_f \right\rbrace\)
If \(\text{ratio} \geq 1\), accept \(\theta'\), else flip a coin with \(\text{Pr}(X=1) = \alpha\) and accept \(\theta'\) if \(X=1\), stay with current \(\theta\), if \(X=0\)

Why in case of normals, \(q(\theta'\mid\theta) = q(\theta\mid\theta')\)

dnorm(1, mean = 0)

[1] 0.2419707

dnorm(0, mean = 1)

[1] 0.2419707

MHR in R

MHR Iteration
MHR for N Iterations

metropolis_iteration_1 <- function(y, current_mean, proposal_scale, 
                                   sd, prior_mean, prior_sd) {
# assume prior ~ N(prior_mean, prior_sd) and sd is known
# proposal sampling distribution q = N(current_mean, proposal_scale)

  proposal_mean <- rnorm(n = 1, mean = current_mean,  sd = proposal_scale) # q(mu' | mu)  
  f_proposal    <- dnorm(proposal_mean, mean = prior_mean, # proposal prior, f(theta')
                         sd = prior_sd) *                  
                   dnorm(y, mean = proposal_mean, sd = sd) # proposal lik, f(y | theta')
  f_current     <- dnorm(current_mean, mean = prior_mean,  # current prior, f(theta)
                         sd = prior_sd) *        
                   dnorm(y, mean = current_mean, sd = sd)  # current lik, f(y | theta)

  ratio <- f_proposal / f_current # [f(theta') * f(y | theta')] / [f(theta) * f(y | theta)]
  alpha <- min(ratio, 1)     
  
  if (alpha > runif(1)) {         # this is just another way of flipping a coint
    next_value <- proposal_mean  
  } else {
    next_value <- current_mean
  }
  return(next_value)
}

mhr <- function(y, f, N, start, ...) {
# y: new observation
# f: function that implements one MHR iteration
# N: number of iterations
# start: initial value of the chain
# ...: additional arguments to f
  
  draws <- numeric(N)
  draws[1] <- f(y, current_mean = start, ...)
  
  for (i in 2:N) {
    draws[i] <- f(y, current_mean = draws[i - 1], ...)
  }
  
  return(draws)
}

y <- 5; N <- 5e3; start <- 3
d <- mhr(y, N, f = metropolis_iteration_1, start, proposal_scale = 2, sd = 0.5,  
         prior_mean = 0, prior_sd = 2)

Markov Chain Animation

Autocorrelation function (ACF)

Comparing Samples to the True Posterior

np <- normal_normal_post(y = y, 
                         sd = 0.5, 
                         prior_mu = 0, 
                         prior_sd = 2)

theta <- seq(np$post_mu + 3*np$post_sd, 
             np$post_mu - 3*np$post_sd, 
             len = 100)

dn <- dnorm(theta, mean = np$post_mu, 
                     sd = np$post_sd)

p <- ggplot(aes(d), data = tibble(d = d))
p + geom_histogram(aes(y = after_stat(density)), 
                   bins = 25, alpha = 0.6) +
  geom_line(aes(theta, dn), linewidth = 0.5, 
            color = 'red', 
            data = tibble(theta, dn)) + 
  ylab("") + xlab(expression(theta)) + 
  ggtitle("MHR draws vs Normal(4.71, 0.49)")

What is Wrong with this Code

metropolis_iteration_1 <- function(y, current_mean, proposal_scale, 
                                   sd, prior_mean, prior_sd) {
# assume prior ~ N(prior_mean, prior_sd) and sd is known
# proposal sampling distribution q = N(current_mean, proposal_scale)

  proposal_mean <- rnorm(1, current_mean,  proposal_scale) # q(mu' | mu)        
  f_proposal    <- dnorm(proposal_mean, mean = prior_mean, # proposal prior, f(theta')
                         sd = prior_sd) *                  
                   dnorm(y, mean = proposal_mean, sd = sd) # proposal lik, f(y | theta')
  f_current     <- dnorm(current_mean, mean = prior_mean,  # current prior, f(theta)
                         sd = prior_sd) *        
                   dnorm(y, mean = current_mean, sd = sd)  # current lik, f(y | theta)

  ratio <- f_proposal / f_current # [f(theta') * f(y | theta')] / [f(theta) * f(y | theta)]
  alpha <- min(ratio, 1)     
  
  if (alpha > runif(1)) {         # this is just another way of flipping a coint
    next_value <- proposal_mean  
  } else {
    next_value <- current_mean
  }
  return(next_value)
}

What if Y is a vector?

Recall the likelihood of many independent observations, is the product of their individual likelihoods \[ \begin{eqnarray} f(y \mid \mu) & = & \prod_{i=1}^{n}\frac{1}{\sqrt{2 \pi} \ \sigma} \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \mu}{\sigma} \right)^2 \right) \\ f(\mu \mid y) & \propto & \text{prior} \cdot \prod_{i=1}^{n}\frac{1}{ \sigma} \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \mu}{\sigma} \right)^2 \right) \\ \end{eqnarray} \]
This suggests the following change:

metropolis_iteration_2 <- function(y, current_mean, proposal_scale, sd,
                                 prior_mean, prior_sd) {

  proposal_mean <- rnorm(1, current_mean,  proposal_scale)        
  f_proposal    <- dnorm(proposal_mean, mean = prior_mean, sd = prior_sd) * 
                   prod(dnorm(y, mean = proposal_mean, sd = sd))   
  f_current     <- dnorm(current_mean, mean = prior_mean, sd = prior_sd) *        
                   prod(dnorm(y, mean = current_mean, sd = sd))

  if ((f_proposal || f_current) == 0) {
    return("Error: underflow") # on my computer double.xmin = 2.225074e-308
  }
  ratio <- f_proposal / f_current 
  alpha <- min(ratio, 1)     
  
  if (alpha > runif(1)) {         # definitely go if f_mu_prime > f_mu: alpha = 1
    next_value <- proposal_mean   # if alpha < 1, go if alpha > U(0, 1)
  } else {
    next_value <- current_mean
  }
  return(next_value)
}

Handling Underflow

set.seed(123)
y <- rnorm(300, mean = 2, sd = 0.55)
normal_normal_post(y = y, sd = 0.55, prior_mu = 0, prior_sd = 1) |> unlist() |>
  round(2)

post_mu post_sd 
   2.02    0.03

replicate(20, metropolis_iteration_2(y = y, 
                                     current_mean = 1, 
                                     proposal_scale = 2, 
                                     sd = 0.55, 
                                     prior_mean = 0, 
                                     prior_sd = 1))

 [1] "Error: underflow" "Error: underflow" "Error: underflow" "Error: underflow"
 [5] "Error: underflow" "1.66235834591796" "Error: underflow" "2.55673000751367"
 [9] "Error: underflow" "Error: underflow" "2.51354952759192" "Error: underflow"
[13] "Error: underflow" "Error: underflow" "3.02905649937656" "Error: underflow"
[17] "Error: underflow" "2.82278258359193" "2.67485301455861" "Error: underflow"

Always Compute on the Log Scale

For one observation \(y\): \[ \begin{eqnarray} \log f(y \mid \mu) & \propto & \log \left( \frac{1}{ \sigma} \exp\left( - \, \frac{1}{2} \left( \frac{y - \mu}{\sigma} \right)^2 \right) \right) \\ & = & -\log(\sigma) - \frac{1}{2} \left( \frac{y - \mu}{\sigma} \right)^2 \end{eqnarray} \]

dnorm_log <- function(y, mean = 0, sd = 1) {
# dropping constant terms; not needed for MCMC
  -log(sd) - 0.5 * ((y - mean) / sd)^2
}

For multiple observations, you can just sum the log-likelihood

MHR on the Log Scale

metropolis_iteration_log <- function(y, current_mean, proposal_scale, 
                                     sd, prior_mean, prior_sd) {
  
  # draw a proposal q(proposal_mean | current_mean)
  proposal_mean  <- rnorm(1, current_mean,  proposal_scale)           
  
  # construct a proposal: f(mu') * \prod f(y_i | mu') on the log scale
  proposal_lik   <- sum(dnorm_log(y, mean = proposal_mean, sd = sd))
  proposal_prior <- dnorm_log(proposal_mean, mean = prior_mean, sd = prior_sd)
  f_proposal     <- proposal_prior + proposal_lik
  
  # construct a current: f(mu) * \prod f(y_i | mu) on the log scale
  current_lik    <- sum(dnorm_log(y, mean = current_mean, sd = sd))
  current_prior  <- dnorm_log(current_mean, mean = prior_mean, sd = prior_sd)
  f_current      <- current_prior + current_lik
  
  # ratio on the log scale = difference of the logs
  log_ratio <- f_proposal - f_current
  log_alpha <- min(log_ratio, 0)     
  
  if (log_alpha > log(runif(1))) {  
    next_value <- proposal_mean     
  } else {
    next_value <- current_mean
  }
  return(next_value)
}

Checking the Work

set.seed(123)
y <- rnorm(300, mean = 2, sd = 0.55)
normal_normal_post(y = y, sd = 0.55, prior_mu = 0, prior_sd = 1) |> unlist() |>
  round(2)

post_mu post_sd 
   2.02    0.03

replicate(4, metropolis_iteration_2(y = y, 
                                     current_mean = 1, 
                                     proposal_scale = 2, 
                                     sd = 0.55, 
                                     prior_mean = 0, 
                                     prior_sd = 1))

[1] "Error: underflow" "Error: underflow" "Error: underflow" "Error: underflow"

replicate(4, metropolis_iteration_log(y = y, 
                                     current_mean = 1, 
                                     proposal_scale = 2, 
                                     sd = 0.55, 
                                     prior_mean = 0, 
                                     prior_sd = 1))

[1] 1.000000 1.000000 1.423961 2.379762

d <- mhr(y, N, f = metropolis_iteration_log, start, proposal_scale = 2, sd = 0.55,  prior_mean = 0, prior_sd = 1)
mean(d) |> round(2)

[1] 2.02

sd(d) |> round(2)

[1] 0.03

Why Does the Algorithm Work

To show why the algorithm works, we need to show:
1. The chain has stationary distribution and it is unique
2. The stationary distribution is our target distribution \(f(\theta \mid y)\)
Condition (a) requires some theory of Markov Chains, but the conditions are mild and are generally satisfied in practice. (See Chapters 11 and 12 in Blitzstein and Hwang for more)
- We will show an outline of the proof for (b)

Why Does the Algorithm Work

We will only consider the case of symmetric q
Suppose you sample two points from the PMF \(f(\theta \mid y)\), \(\theta_a\) and \(\theta_b\) and assume we are at time \(t-1\)
Let the probability of going from \(\theta_a\) to \(\theta_b\) be: \(\P(\theta^t = \theta_b) \mid \theta^{t-1} = \theta_a) := p_{a b}\) and the reverse jump: \(\P(\theta^t = \theta_a) \mid \theta^{t-1} = \theta_b) := p_{b a}\)
We want to show: \[ \begin{eqnarray} \frac{p_{a b}}{p_{b a}} &=& \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} ,\, \text{where} \\ p_{a b} &=& q(\theta_b \mid \theta_a) \cdot \min \left \lbrace 1, \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} \right \rbrace \\ p_{b a} &=& q(\theta_a \mid \theta_b) \cdot \min \left \lbrace 1, \frac{f(\theta_a \mid y)}{f(\theta_b \mid y)} \right \rbrace \end{eqnarray} \]

Why Does the Algorithm Work

For symmetric q: \(q(\theta_b | \theta_a) = q(\theta_a | \theta_b)\) and: \[ \begin{eqnarray} \frac{p_{a b}}{p_{b a}} &=& \frac{\min \left \lbrace 1, \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} \right \rbrace}{\min \left \lbrace 1, \frac{f(\theta_a \mid y)}{f(\theta_b \mid y)} \right \rbrace} \end{eqnarray} \]
Consider the case when \(f(\theta_b \mid y) > f(\theta_a \mid y)\) \[ \begin{eqnarray} \frac{p_{a b}}{p_{b a}} &=& \frac{1}{\frac{f(\theta_a \mid y)}{f(\theta_b \mid y)}} = \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} \end{eqnarray} \]
When \(f(\theta_a \mid y) > f(\theta_b \mid y)\) \[ \frac{p_{a b}}{p_{b a}} = \frac{\frac{f(\theta_b \mid y)}{f(\theta_a \mid y)}}{1} = \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} \]

Introducing Posterior Predictive Distribution

Recall the prior predictive distribution, before observing \(y\), that appears in the denominator of Bayes’s rule: \[ f(y) = \int f(y, \theta) \, d\theta = \int f(\theta) f(y \mid \theta) \, d\theta \]
A posterior predictive distribution, \(f(\tilde{y} | y)\) is obtained in a similar manner \[ \begin{eqnarray} f(\tilde{y} \mid y) &=& \int f(\tilde{y}, \theta \mid y) \, d\theta \\ &=& \int f(\theta \mid y) f(\tilde{y} \mid \theta, y) \, d\theta \\ &=& \int f(\theta \mid y) f(\tilde{y} \mid \theta) \, d\theta \\ \end{eqnarray} \]
\(f(\tilde{y} \mid \theta, y) = f(\tilde{y} \mid \theta)\) since \(y \perp\!\!\!\perp \tilde{y} \mid \theta\) (conditional indepedence)
Two sources of variability are accounted for: sampling variability in \(\tilde{y}\) weighted by posterior variability in \(\theta\)
Given draws from \(f(\theta \mid y)\), \(\theta^{(m)} \sim f(\theta \mid y)\), we can compute the integral in a usual way: \(f(\tilde{y} \mid y) \approx \frac{1}{M} \sum_{m = 1}^M f(\tilde{y} \mid \theta^{(m)})\)

Introducing Posterior Predictive Distribution

For simple models we can often evalute \(\int f(\theta \mid y) f(\tilde{y} \mid \theta) \, d\theta\) directly
We will use an inderect approach using our example Normal-Normal model with known variance
Since we already know \(f(\theta | y)\), we will sample \(\theta\) from it, and then sample \(\tilde{y}\), from \(f(\tilde{y} \mid \theta)\)
Recall, \(y \sim \text{Normal}(\mu, 0.5)\) with prior \(\mu \sim \text{Normal}(0, 2)\)
For \(y = 5\), the posterior \(\mu | y \sim \text{Normal}(4.71, 0.49)\)

ppd <- function(post_mu, post_sd) {
  mu <- rnorm(1, mean = post_mu, sd = post_sd)
  y  <- rnorm(1, mean = mu, sd = 0.5)
}
pd <- normal_normal_post(y = 5, sd = 0.5, prior_mu = 0, prior_sd = 2)
y <- replicate(1e5, ppd(pd$post_mu, pd$post_sd))
cat("y_tilde | y ~ Normal(", round(mean(y), 2), ",", round(sd(y), 2), ")", sep = "")

y_tilde | y ~ Normal(4.71,0.7)

It can be shown that the posterior predictive distribution will have the same mean as the posterior distribution and the variance as the sum of the posterior variance and data variance:

ppd_mu <- pd$post_mu |> round(2)
ppd_sigma <- sqrt(pd$post_sd^2 + 0.5^2) |> round(2)
cat("y_tilde | y ~ Normal(", ppd_mu, ", ", ppd_sigma, ")", sep = "")

y_tilde | y ~ Normal(4.71, 0.7)

Posterior Predictions in Stan

You can compute posterior predictive distribution in R, Stan, and rstanarm
Here, we will see how to do it in Stan
We will show an example in rstanarm in the next lecture

data {
  real y;
  real<lower=0> sigma;
}
parameters {
  real mu;
}
model {
  mu ~ normal(0, 2);
  y ~ normal(mu, sigma);
}
generated quantities {
  real y_tilde = normal_rng(mu, sigma);
}

Posterior Predictions in Stan

library(cmdstanr)

m1 <- cmdstan_model("stan/normal_pred.stan") # compile the model
data <- list(y = 5, sigma = 0.5)
f1 <- m1$sample(       # for other options to sample, help(sample)
  data = data,         # pass data as a list, match the vars name to Stan
  seed = 123,          # to reproduce results, Stan does not rely on R's seed
  chains = 4,          # total chains, the more, the better
  parallel_chains = 4, # for multi-processor CPUs
  refresh = 0,         # number of iterations printed on the screen
  iter_warmup = 500,   # number of draws for warmup (per chain)
  iter_sampling = 2000 # number of draws for samples (per chain)
)

Running MCMC with 4 parallel chains...

Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.2 seconds.

f1$summary()

# A tibble: 3 × 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -3.47  -3.18 0.745 0.326 -4.95 -2.94  1.00    3507.    4238.
2 mu        4.71   4.71 0.498 0.498  3.89  5.53  1.00    2899.    4051.
3 y_tilde   4.71   4.71 0.706 0.706  3.52  5.87  1.00    4572.    5962.

Breaking Codes with MCMC

Dianconis
Cypher text
Algorithm
Decoded text