Bayesian Inference

NYU Applied Statistics for Social Science Research

Eric Novik | Spring 2024 | Lecture 4

MCMC, Posterior inference, and Prediction

  • Metropolis-Hastings-Rosenbluth algorithm
  • R implementation and testing
  • Computational issues
  • Why does MHR work
  • Posterior predictive distribution
  • Posterior predictive simulation in Stan
  • Optimization and code breaking with MCMC

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathcal{L}} \DeclareMathOperator{\I}{\text{I}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \]

High-Level Outline of MHR

  • The idea is to spend “more time” in the area of high posterior volume
  • Pick a random starting point at \(i=1\) with \(\theta^{(1)}\)
  • Propose a next possible value of \(\theta\), call it \(\theta'\) from an (easy) proposal distribution
  • Evaluate if you should accept or reject the proposal
  • If accepted, go to \(\theta^{'(2)}\), otherwise stay at \(\theta^{(2)} = \theta^{(1)}\)
  • Rinse and repeat
  • If we can get an independant draw from \(\theta\), we just take it
  • That amounts to regular Monte Carlo sampling, like rnorm(1, mean = 0, sd = 1)
  • Otherwise, we need a rule for evaluating when to accept and when to reject the proposal
  • We still need a way to evaluate the density at the proposed value (it will be part of the rule)
  • The big idea: if the proposed \(\theta'\) has a higher plausibility than the current \(\theta\), accept \(\theta'\); if not, sometimes accept \(\theta'\)

Normal-Normal with Known \(\sigma\)

  • We will start with a known posterior
  • Let \(y \sim \text{Normal}(\mu, 0.5)\)
  • Let the prior \(\mu \sim \text{Normal}(0, 2)\)
  • Let’s say we observe \(y = 5\)
  • We can immediately compute the posterior \(\mu \mid y \sim \text{Normal}(4.71, 0.49)\)
normal_normal_post <- function(y, sd, prior_mu, prior_sd) {
# for the case where sd is known
  prior_prec <- 1/prior_sd^2 
  data_prec <- 1/sd^2
  n <- length(y)
  post_mu <- (prior_prec * prior_mu + data_prec * n * mean(y)) /
             (prior_prec + n * data_prec)
  post_prec <- prior_prec + n * data_prec
  post_sd <- sqrt(1/post_prec)
  return(list(post_mu = post_mu, post_sd = post_sd))
}

normal_normal_post(y = 5, sd = 0.5, prior_mu = 0, prior_sd = 2) |>
  unlist() |> round(2)
post_mu post_sd 
   4.71    0.49 

Metropolis-Hastings-Rosenbluth (MHR)

  • We are trying to generate draws: \(\left( \theta^{(1)}, \theta^{(2)}, ..., \theta^{(N)}\right)\) implied by the target density \(f\)
  • Pick the first value of \(\theta\) randomly or deterministically
  • Draw \(\theta'\) from a proposal distribution \(q(\theta'\mid\theta)\)
  • Compute the unnormalized \(f(\theta'\mid y) \propto f(y \mid \theta') f(\theta') = f_{\text{prop}}\)
  • Compute the unnormalized \(f(\theta \mid y) \propto f(y \mid \theta) f(\theta) = f_{\text{current}}\)
  • Compute \(\text{ratio}_f = \frac{f_{\text{prop}}}{f_{\text{current}}}\) and \(\text{ratio}_q = \frac{q(\theta\mid\theta')}{q(\theta'\mid\theta)}\)
  • Asseptance probability \(\alpha = \min\left\lbrace 1, \text{ratio}_f \cdot \text{ratio}_q \right\rbrace\)
  • If \(q\) is symmetric, we can drop \(\text{ratio}_q\), in which case \(\alpha = \min\left\lbrace 1, \text{ratio}_f \right\rbrace\)
  • If \(\text{ratio} \geq 1\), accept \(\theta'\), else flip a coin with \(\text{Pr}(X=1) = \alpha\) and accept \(\theta'\) if \(X=1\), stay with current \(\theta\), if \(X=0\)
  • Why in case of normals, \(q(\theta'\mid\theta) = q(\theta\mid\theta')\)

dnorm(1, mean = 0)
[1] 0.2419707
dnorm(0, mean = 1)
[1] 0.2419707

MHR in R

metropolis_iteration_1 <- function(y, current_mean, proposal_scale, 
                                   sd, prior_mean, prior_sd) {
# assume prior ~ N(prior_mean, prior_sd) and sd is known
# proposal sampling distribution q = N(current_mean, proposal_scale)

  proposal_mean <- rnorm(n = 1, mean = current_mean,  sd = proposal_scale) # q(mu' | mu)  
  f_proposal    <- dnorm(proposal_mean, mean = prior_mean, # proposal prior, f(theta')
                         sd = prior_sd) *                  
                   dnorm(y, mean = proposal_mean, sd = sd) # proposal lik, f(y | theta')
  f_current     <- dnorm(current_mean, mean = prior_mean,  # current prior, f(theta)
                         sd = prior_sd) *        
                   dnorm(y, mean = current_mean, sd = sd)  # current lik, f(y | theta)

  ratio <- f_proposal / f_current # [f(theta') * f(y | theta')] / [f(theta) * f(y | theta)]
  alpha <- min(ratio, 1)     
  
  if (alpha > runif(1)) {         # this is just another way of flipping a coint
    next_value <- proposal_mean  
  } else {
    next_value <- current_mean
  }
  return(next_value)
}
mhr <- function(y, f, N, start, ...) {
# y: new observation
# f: function that implements one MHR iteration
# N: number of iterations
# start: initial value of the chain
# ...: additional arguments to f
  
  draws <- numeric(N)
  draws[1] <- f(y, current_mean = start, ...)
  
  for (i in 2:N) {
    draws[i] <- f(y, current_mean = draws[i - 1], ...)
  }
  
  return(draws)
}

y <- 5; N <- 5e3; start <- 3
d <- mhr(y, N, f = metropolis_iteration_1, start, proposal_scale = 2, sd = 0.5,  
         prior_mean = 0, prior_sd = 2)

Markov Chain Animation

Autocorrelation function (ACF)

Comparing Samples to the True Posterior


np <- normal_normal_post(y = y, 
                         sd = 0.5, 
                         prior_mu = 0, 
                         prior_sd = 2)

theta <- seq(np$post_mu + 3*np$post_sd, 
             np$post_mu - 3*np$post_sd, 
             len = 100)

dn <- dnorm(theta, mean = np$post_mu, 
                     sd = np$post_sd)

p <- ggplot(aes(d), data = tibble(d = d))
p + geom_histogram(aes(y = after_stat(density)), 
                   bins = 25, alpha = 0.6) +
  geom_line(aes(theta, dn), linewidth = 0.5, 
            color = 'red', 
            data = tibble(theta, dn)) + 
  ylab("") + xlab(expression(theta)) + 
  ggtitle("MHR draws vs Normal(4.71, 0.49)")

What is Wrong with this Code


metropolis_iteration_1 <- function(y, current_mean, proposal_scale, 
                                   sd, prior_mean, prior_sd) {
# assume prior ~ N(prior_mean, prior_sd) and sd is known
# proposal sampling distribution q = N(current_mean, proposal_scale)

  proposal_mean <- rnorm(1, current_mean,  proposal_scale) # q(mu' | mu)        
  f_proposal    <- dnorm(proposal_mean, mean = prior_mean, # proposal prior, f(theta')
                         sd = prior_sd) *                  
                   dnorm(y, mean = proposal_mean, sd = sd) # proposal lik, f(y | theta')
  f_current     <- dnorm(current_mean, mean = prior_mean,  # current prior, f(theta)
                         sd = prior_sd) *        
                   dnorm(y, mean = current_mean, sd = sd)  # current lik, f(y | theta)

  ratio <- f_proposal / f_current # [f(theta') * f(y | theta')] / [f(theta) * f(y | theta)]
  alpha <- min(ratio, 1)     
  
  if (alpha > runif(1)) {         # this is just another way of flipping a coint
    next_value <- proposal_mean  
  } else {
    next_value <- current_mean
  }
  return(next_value)
}

What if Y is a vector?

  • Recall the likelihood of many independent observations, is the product of their individual likelihoods \[ \begin{eqnarray} f(y \mid \mu) & = & \prod_{i=1}^{n}\frac{1}{\sqrt{2 \pi} \ \sigma} \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \mu}{\sigma} \right)^2 \right) \\ f(\mu \mid y) & \propto & \text{prior} \cdot \prod_{i=1}^{n}\frac{1}{ \sigma} \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \mu}{\sigma} \right)^2 \right) \\ \end{eqnarray} \]

  • This suggests the following change:

metropolis_iteration_2 <- function(y, current_mean, proposal_scale, sd,
                                 prior_mean, prior_sd) {

  proposal_mean <- rnorm(1, current_mean,  proposal_scale)        
  f_proposal    <- dnorm(proposal_mean, mean = prior_mean, sd = prior_sd) * 
                   prod(dnorm(y, mean = proposal_mean, sd = sd))   
  f_current     <- dnorm(current_mean, mean = prior_mean, sd = prior_sd) *        
                   prod(dnorm(y, mean = current_mean, sd = sd))

  if ((f_proposal || f_current) == 0) {
    return("Error: underflow") # on my computer double.xmin = 2.225074e-308
  }
  ratio <- f_proposal / f_current 
  alpha <- min(ratio, 1)     
  
  if (alpha > runif(1)) {         # definitely go if f_mu_prime > f_mu: alpha = 1
    next_value <- proposal_mean   # if alpha < 1, go if alpha > U(0, 1)
  } else {
    next_value <- current_mean
  }
  return(next_value)
}

Handling Underflow

set.seed(123)
y <- rnorm(300, mean = 2, sd = 0.55)
normal_normal_post(y = y, sd = 0.55, prior_mu = 0, prior_sd = 1) |> unlist() |>
  round(2)
post_mu post_sd 
   2.02    0.03 
replicate(20, metropolis_iteration_2(y = y, 
                                     current_mean = 1, 
                                     proposal_scale = 2, 
                                     sd = 0.55, 
                                     prior_mean = 0, 
                                     prior_sd = 1))
 [1] "Error: underflow" "Error: underflow" "Error: underflow" "Error: underflow"
 [5] "Error: underflow" "1.66235834591796" "Error: underflow" "2.55673000751367"
 [9] "Error: underflow" "Error: underflow" "2.51354952759192" "Error: underflow"
[13] "Error: underflow" "Error: underflow" "3.02905649937656" "Error: underflow"
[17] "Error: underflow" "2.82278258359193" "2.67485301455861" "Error: underflow"

Always Compute on the Log Scale

  • For one observation \(y\): \[ \begin{eqnarray} \log f(y \mid \mu) & \propto & \log \left( \frac{1}{ \sigma} \exp\left( - \, \frac{1}{2} \left( \frac{y - \mu}{\sigma} \right)^2 \right) \right) \\ & = & -\log(\sigma) - \frac{1}{2} \left( \frac{y - \mu}{\sigma} \right)^2 \end{eqnarray} \]
dnorm_log <- function(y, mean = 0, sd = 1) {
# dropping constant terms; not needed for MCMC
  -log(sd) - 0.5 * ((y - mean) / sd)^2
}
  • For multiple observations, you can just sum the log-likelihood

MHR on the Log Scale

metropolis_iteration_log <- function(y, current_mean, proposal_scale, 
                                     sd, prior_mean, prior_sd) {
  
  # draw a proposal q(proposal_mean | current_mean)
  proposal_mean  <- rnorm(1, current_mean,  proposal_scale)           
  
  # construct a proposal: f(mu') * \prod f(y_i | mu') on the log scale
  proposal_lik   <- sum(dnorm_log(y, mean = proposal_mean, sd = sd))
  proposal_prior <- dnorm_log(proposal_mean, mean = prior_mean, sd = prior_sd)
  f_proposal     <- proposal_prior + proposal_lik
  
  # construct a current: f(mu) * \prod f(y_i | mu) on the log scale
  current_lik    <- sum(dnorm_log(y, mean = current_mean, sd = sd))
  current_prior  <- dnorm_log(current_mean, mean = prior_mean, sd = prior_sd)
  f_current      <- current_prior + current_lik
  
  # ratio on the log scale = difference of the logs
  log_ratio <- f_proposal - f_current
  log_alpha <- min(log_ratio, 0)     
  
  if (log_alpha > log(runif(1))) {  
    next_value <- proposal_mean     
  } else {
    next_value <- current_mean
  }
  return(next_value)
}

Checking the Work

set.seed(123)
y <- rnorm(300, mean = 2, sd = 0.55)
normal_normal_post(y = y, sd = 0.55, prior_mu = 0, prior_sd = 1) |> unlist() |>
  round(2)
post_mu post_sd 
   2.02    0.03 
replicate(4, metropolis_iteration_2(y = y, 
                                     current_mean = 1, 
                                     proposal_scale = 2, 
                                     sd = 0.55, 
                                     prior_mean = 0, 
                                     prior_sd = 1))
[1] "Error: underflow" "Error: underflow" "Error: underflow" "Error: underflow"
replicate(4, metropolis_iteration_log(y = y, 
                                     current_mean = 1, 
                                     proposal_scale = 2, 
                                     sd = 0.55, 
                                     prior_mean = 0, 
                                     prior_sd = 1))
[1] 1.000000 1.000000 1.423961 2.379762
d <- mhr(y, N, f = metropolis_iteration_log, start, proposal_scale = 2, sd = 0.55,  prior_mean = 0, prior_sd = 1)
mean(d) |> round(2)
[1] 2.02
sd(d) |> round(2)
[1] 0.03

Why Does the Algorithm Work

  • To show why the algorithm works, we need to show:
    1. The chain has stationary distribution and it is unique
    2. The stationary distribution is our target distribution \(f(\theta \mid y)\)
  • Condition (a) requires some theory of Markov Chains, but the conditions are mild and are generally satisfied in practice. (See Chapters 11 and 12 in Blitzstein and Hwang for more)
    • We will show an outline of the proof for (b)

Why Does the Algorithm Work

  • We will only consider the case of symmetric q
  • Suppose you sample two points from the PMF \(f(\theta \mid y)\), \(\theta_a\) and \(\theta_b\) and assume we are at time \(t-1\)
  • Let the probability of going from \(\theta_a\) to \(\theta_b\) be: \(\P(\theta^t = \theta_b) \mid \theta^{t-1} = \theta_a) := p_{a b}\) and the reverse jump: \(\P(\theta^t = \theta_a) \mid \theta^{t-1} = \theta_b) := p_{b a}\)
  • We want to show: \[ \begin{eqnarray} \frac{p_{a b}}{p_{b a}} &=& \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} ,\, \text{where} \\ p_{a b} &=& q(\theta_b \mid \theta_a) \cdot \min \left \lbrace 1, \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} \right \rbrace \\ p_{b a} &=& q(\theta_a \mid \theta_b) \cdot \min \left \lbrace 1, \frac{f(\theta_a \mid y)}{f(\theta_b \mid y)} \right \rbrace \end{eqnarray} \]

Why Does the Algorithm Work

  • For symmetric q: \(q(\theta_b | \theta_a) = q(\theta_a | \theta_b)\) and: \[ \begin{eqnarray} \frac{p_{a b}}{p_{b a}} &=& \frac{\min \left \lbrace 1, \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} \right \rbrace}{\min \left \lbrace 1, \frac{f(\theta_a \mid y)}{f(\theta_b \mid y)} \right \rbrace} \end{eqnarray} \]
  • Consider the case when \(f(\theta_b \mid y) > f(\theta_a \mid y)\) \[ \begin{eqnarray} \frac{p_{a b}}{p_{b a}} &=& \frac{1}{\frac{f(\theta_a \mid y)}{f(\theta_b \mid y)}} = \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} \end{eqnarray} \]
  • When \(f(\theta_a \mid y) > f(\theta_b \mid y)\) \[ \frac{p_{a b}}{p_{b a}} = \frac{\frac{f(\theta_b \mid y)}{f(\theta_a \mid y)}}{1} = \frac{f(\theta_b \mid y)}{f(\theta_a \mid y)} \]

Introducing Posterior Predictive Distribution

  • Recall the prior predictive distribution, before observing \(y\), that appears in the denominator of Bayes’s rule: \[ f(y) = \int f(y, \theta) \, d\theta = \int f(\theta) f(y \mid \theta) \, d\theta \]

  • A posterior predictive distribution, \(f(\tilde{y} | y)\) is obtained in a similar manner \[ \begin{eqnarray} f(\tilde{y} \mid y) &=& \int f(\tilde{y}, \theta \mid y) \, d\theta \\ &=& \int f(\theta \mid y) f(\tilde{y} \mid \theta, y) \, d\theta \\ &=& \int f(\theta \mid y) f(\tilde{y} \mid \theta) \, d\theta \\ \end{eqnarray} \]

  • \(f(\tilde{y} \mid \theta, y) = f(\tilde{y} \mid \theta)\) since \(y \perp\!\!\!\perp \tilde{y} \mid \theta\) (conditional indepedence)

  • Two sources of variability are accounted for: sampling variability in \(\tilde{y}\) weighted by posterior variability in \(\theta\)

  • Given draws from \(f(\theta \mid y)\), \(\theta^{(m)} \sim f(\theta \mid y)\), we can compute the integral in a usual way: \(f(\tilde{y} \mid y) \approx \frac{1}{M} \sum_{m = 1}^M f(\tilde{y} \mid \theta^{(m)})\)

Introducing Posterior Predictive Distribution

  • For simple models we can often evalute \(\int f(\theta \mid y) f(\tilde{y} \mid \theta) \, d\theta\) directly
  • We will use an inderect approach using our example Normal-Normal model with known variance
  • Since we already know \(f(\theta | y)\), we will sample \(\theta\) from it, and then sample \(\tilde{y}\), from \(f(\tilde{y} \mid \theta)\)
  • Recall, \(y \sim \text{Normal}(\mu, 0.5)\) with prior \(\mu \sim \text{Normal}(0, 2)\)
  • For \(y = 5\), the posterior \(\mu | y \sim \text{Normal}(4.71, 0.49)\)
ppd <- function(post_mu, post_sd) {
  mu <- rnorm(1, mean = post_mu, sd = post_sd)
  y  <- rnorm(1, mean = mu, sd = 0.5)
}
pd <- normal_normal_post(y = 5, sd = 0.5, prior_mu = 0, prior_sd = 2)
y <- replicate(1e5, ppd(pd$post_mu, pd$post_sd))
cat("y_tilde | y ~ Normal(", round(mean(y), 2), ",", round(sd(y), 2), ")", sep = "")
y_tilde | y ~ Normal(4.71,0.7)
  • It can be shown that the posterior predictive distribution will have the same mean as the posterior distribution and the variance as the sum of the posterior variance and data variance:
ppd_mu <- pd$post_mu |> round(2)
ppd_sigma <- sqrt(pd$post_sd^2 + 0.5^2) |> round(2)
cat("y_tilde | y ~ Normal(", ppd_mu, ", ", ppd_sigma, ")", sep = "")
y_tilde | y ~ Normal(4.71, 0.7)

Posterior Predictions in Stan

  • You can compute posterior predictive distribution in R, Stan, and rstanarm
  • Here, we will see how to do it in Stan
  • We will show an example in rstanarm in the next lecture
data {
  real y;
  real<lower=0> sigma;
}
parameters {
  real mu;
}
model {
  mu ~ normal(0, 2);
  y ~ normal(mu, sigma);
}
generated quantities {
  real y_tilde = normal_rng(mu, sigma);
}

Posterior Predictions in Stan

library(cmdstanr)

m1 <- cmdstan_model("stan/normal_pred.stan") # compile the model
data <- list(y = 5, sigma = 0.5)
f1 <- m1$sample(       # for other options to sample, help(sample)
  data = data,         # pass data as a list, match the vars name to Stan
  seed = 123,          # to reproduce results, Stan does not rely on R's seed
  chains = 4,          # total chains, the more, the better
  parallel_chains = 4, # for multi-processor CPUs
  refresh = 0,         # number of iterations printed on the screen
  iter_warmup = 500,   # number of draws for warmup (per chain)
  iter_sampling = 2000 # number of draws for samples (per chain)
)
Running MCMC with 4 parallel chains...

Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.2 seconds.
f1$summary()
# A tibble: 3 × 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -3.47  -3.18 0.745 0.326 -4.95 -2.94  1.00    3507.    4238.
2 mu        4.71   4.71 0.498 0.498  3.89  5.53  1.00    2899.    4051.
3 y_tilde   4.71   4.71 0.706 0.706  3.52  5.87  1.00    4572.    5962.

Breaking Codes with MCMC