Bayesian Inference

NYU Applied Statistics for Social Science Research

Eric Novik | Spring 2024 | Lecture 5

Homework Feedback

How was Homework 4?

Bayesian linear regression and model evaluation

  • Introducing linear regression
  • Prior predictive simulations
  • Sampling from the posterior
  • Example of linear regression in Stan
  • Evaluating the quality of the draws
  • Posterior predictions
  • Cross-validation, ELPD, and LOO

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathcal{L}} \DeclareMathOperator{\I}{\text{I}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \]

Motivating Example

  • We borrow this example from Richard McElreath’s Statistical Rethinking
  • The data sets provided have been produced between 1969 to 2008, based on Nancy Howell’s observations of the !Kung San
  • From Wikipedia: “The ǃKung are one of the San peoples who live mostly on the western edge of the Kalahari desert, Ovamboland (northern Namibia and southern Angola), and Botswana.”

Howell Dataset

  • Data sample and summary:
height weight age male
151.765 47.82561 63 1
139.700 36.48581 63 0
136.525 31.86484 65 0
156.845 53.04191 41 1
145.415 41.27687 51 0
163.830 62.99259 35 1
     height           weight            age       
 Min.   : 53.98   Min.   : 4.252   Min.   : 0.00  
 1st Qu.:125.09   1st Qu.:22.008   1st Qu.:12.00  
 Median :148.59   Median :40.058   Median :27.00  
 Mean   :138.26   Mean   :35.611   Mean   :29.34  
 3rd Qu.:157.48   3rd Qu.:47.209   3rd Qu.:43.00  
 Max.   :179.07   Max.   :62.993   Max.   :88.00  

  • Notice a non-linearity
  • Thinking about why this should be, can give you an insight into how to model these data

Howell Dataset

  • For now, we will focus on the linear subset of the data
  • We will demonstrate the non-linear model at the end
  • We will restrict our attention to adults (age > 18)

General Approach

  1. Assess the scope of the inferences that you will get with this model
  2. Unless you are doing causal inference, you should interpret your coefficients as comparisons (RAOS, Page 84)
  3. Set up reasonable priors and likelihood
  4. For more complex models, perform a forward simulation with fixed parameter values and try to recover them by running an inference algorithm. See this example for a standard epidemiological model.
  5. Perform a prior predictive simulation
  6. Possibly adjust your priors
  7. Fit the model to data
  8. Assess the quality of the inference and the quality of the model
  9. Improve or fix your model and go back to #3

Howell Regression

  • We will build a predictive model for adult Weight \(y\) given Height \(x\) using the Howell dataset
  • Initial stab at the model: \[ \begin{eqnarray} y_i & \sim & \text{Normal}(\mu_i, \, \sigma)\\ \mu_i & = & \alpha + \beta x_i \\ \alpha & \sim & \text{Normal}(\alpha_{l}, \, \alpha_s) \\ \beta & \sim & \text{Normal}(\beta_{l}, \, \beta_s) \\ \sigma & \sim & \text{Exp}(r) \\ \end{eqnarray} \]
  • We have to specify \(\alpha_{l}\) and \(\alpha_s\), where l and s signify location and scale, and r, the rate of the exponential
  • If we work on the original scale for \(x\), it is awkward to choose a prior for the intercept : it corresponds to the weight of the person with zero height
  • This can be fixed by subtracting the average height from \(x\)
  • Why is \(\text{Exp}(r)\) is a reasonable prior for \(\sigma\)?

Howell Regression

  • We define a new variable, the centered version of \(x\): \(x^c_i = x_i - \bar{x}\)
  • Now \(\alpha\) corresponds to the weight of an average person
  • Checking Wikipedia reveals that the average weight of a person in Africa is about 60 kg
  • They don’t state the standard deviation, but it is unlikely that an African adult would weigh less than 30 kg and more than 120 kg so we will set the prior sd = 10 \[ \begin{eqnarray} y_i & \sim & \text{Normal}(\mu_i, \, \sigma)\\ \mu_i & = & \alpha + \beta x^c_i \\ \alpha & \sim & \text{Normal}(60, \, 10) \\ \beta & \sim & \text{Normal}(\beta_{l}, \, \beta_s) \\ \sigma & \sim & \text{Exp}(r) \\ \end{eqnarray} \]
  • What about the slope \(\beta\)?

Howell Regression

  • In this dataset, the units of \(\beta\) are \(\frac{kg}{cm}\), since the units of height are \(cm\)
  • First thing, \(\beta\) should be positive. Why?
  • Second, \(\beta\) is likely less than 1. Why?
  • We can consult height-weight tables for the expected value and variance
  • In the dataset, \(\E(\beta) = 0.55\) with a standard error of 0.006, but since we are uncertain how applicable that is to !Kung, we will allow the prior to vary more \[ \begin{eqnarray} y_i & \sim & \text{Normal}(\mu_i, \, \sigma)\\ \mu_i & = & \alpha + \beta x^c_i \\ \alpha & \sim & \text{Normal}(60, \, 10) \\ \beta & \sim & \text{Normal}(0.55, \, 0.1) \\ \sigma & \sim & \text{Exp}(r) \\ \end{eqnarray} \]
  • What about the error term \(\sigma\)?

Howell Regression

  • We know that \(\sigma\) must be positive, so a possible choice for the prior is \(\text{Normal}^+\), Exponential, etc.
  • At this stage, the key is to rule out implausible values, not to get something precise, particularly since we have enough data (> 340 observations)
  • From the background data, the residual standard error was 4.6, which implies the exponential rate parameter of about 1/5 \[ \begin{eqnarray} y_i & \sim & \text{Normal}(\mu_i, \, \sigma)\\ \mu_i & = & \alpha + \beta x^c_i \\ \alpha & \sim & \text{Normal}(60, \, 10) \\ \beta & \sim & \text{Normal}(0.55, \, 0.1) \\ \sigma & \sim & \text{Exp}(0.2) \\ \end{eqnarray} \]
  • We are now ready to perform a prior predictive simulation

Prior Predictive Simulation

  • The simulation follows the generative process defined by the model
d <- d |>
  mutate(height_c = height - mean(height))
round(mean(d$height_c), 2)
[1] 0
head(d)
# A tibble: 6 × 5
  height weight   age  male height_c
   <dbl>  <dbl> <dbl> <dbl>    <dbl>
1   152.   47.8    63     1    -2.83
2   140.   36.5    63     0   -14.9 
3   137.   31.9    65     0   -18.1 
4   157.   53.0    41     1     2.25
5   145.   41.3    51     0    -9.18
6   164.   63.0    35     1     9.23
prior_pred <- function(data) {
  alpha <- rnorm(1, 60, 10)
  beta <- rnorm(1, 0.55, 0.1)
  sigma <- rexp(1, 0.2)
  l <- nrow(data); y <- numeric(l)
  for (i in 1:l) {
    mu <- alpha + beta * data$height_c[i]
    y[i] <- rnorm(1, mu, sigma)
  }
  return(y)
}
n <- 100
pr_p <- replicate(n = n, prior_pred(d))
# using library(purrr) functional primitives:
# pr_p <- map(1:n, \(i) prior_pred(d))
dim(pr_p)
[1] 352 100
round(pr_p[1:12, 1:8], 2)
       [,1]  [,2]  [,3]   [,4]  [,5]  [,6]  [,7]  [,8]
 [1,] 75.17 45.38 63.41  79.34 70.60 53.38 59.64 71.54
 [2,] 66.90 43.28 60.23  56.40 64.75 45.41 52.28 61.20
 [3,] 58.87 32.81 34.94  71.38 63.30 73.13 66.56 61.37
 [4,] 64.57 47.99 73.31  40.31 72.81 52.06 29.27 79.36
 [5,] 69.54 41.90 61.59  -4.10 66.94 60.39 44.22 72.73
 [6,] 85.82 53.97 68.87 104.92 76.18 60.51 56.68 76.91
 [7,] 62.48 51.62 68.41  25.53 69.20 64.53 70.15 70.65
 [8,] 77.69 62.25 72.45  63.32 78.10 67.48 63.22 81.76
 [9,] 80.35 43.70 54.51  13.69 68.65 46.92 48.20 70.36
[10,] 85.94 51.44 66.29 111.06 76.71 64.20 53.96 78.94
[11,] 85.61 41.28 53.84  60.63 71.72 83.42 47.55 76.78
[12,] 74.94 55.23 61.77  34.21 70.00 53.53 59.05 68.94

Prior Predictive Simulation

data_mean <- mean(d$weight)
mean_dist <- colMeans(pr_p)
ggplot(data.frame(data_mean, mean_dist), aes(mean_dist)) +
  geom_histogram() +
  geom_vline(xintercept = data_mean, color = 'red') +
  xlab("Distribution of Weight means (kg) under the prior") + ylab("") +
  theme(axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank())

Prior Predictive Simulation

  • To get a sense for the possible regression lines implied by the prior, we can fit a linear model to each simulation draw, and plot the lines over observations
intercepts <- numeric(n)
slopes <- numeric(n)
for (i in 1:n) {
  coefs <- coef(lm(pr_p[, i] ~ d$height_c))
  intercepts[i] <- coefs[1]
  slopes[i] <- coefs[2]
}

# using library(purrr) functional primitives:
# df <- pr_p |> map_dfr(\(y) coef(lm(y ~ d$height_c)))

p <- ggplot(aes(height_c, weight), data = d)
p + geom_point(size = 0.5) + ylim(20, 90) + 
  geom_abline(slope = slopes, 
              intercept = intercepts, 
              alpha = 1/6) +
  ylab("Weight (kg)") + 
  xlab("Centered Height (cm)") +
  ggtitle("Kalahari !Kung San people", 
          subtitle = "Prior predictive simulation")

Deriving a Posterior Distribution

  • We have seen how to derive the posterior and posterior predictive distribution
  • Three dimentional posterior: \(f(\alpha, \beta, \sigma)\). What happened to \(\mu\)?
  • We construct the posterior from the prior and data likelihood (for each \(y_i\)): \[ \begin{eqnarray} &\text{Prior: }f(\alpha, \beta, \sigma) = f_1(\alpha) f_2(\beta) f_3(\sigma) \\ &\text{Likelihood: }f(y \mid \alpha, \beta, \sigma) = \prod_{i=1}^{n}f_4(y_i \mid \alpha, \beta, \sigma) \\ &\text{Posterior: }f(\alpha,\beta,\sigma \mid y) = \frac{f_1(\alpha) f(_2\beta) f_3(\sigma) \cdot \left[\prod_{i=1}^{n}f_4(y_i \mid \alpha, \beta, \sigma) \right]} {\int\int\int f_1(\alpha) f_2(\beta) f_3(\sigma) \cdot \left[\prod_{i=1}^{n}f_4(y_i \mid \alpha, \beta, \sigma) \right] d\alpha \, d\beta \, d\sigma} \end{eqnarray} \]
  • To be more precise, we would indicate that \(f_1, f_2\) and \(f_4\) are Normal with different parameters, and \(f_3\) is \(\text{Exp}(0.2)\)

Fitting the Model

  • Even though our prior is slightly off, 300+ observations is a lot in this case (big data!), and so we proceed to model fitting
  • We will use stan_glm() function in rstanarm
  • rstanarm has default priors, but you should specify your own:
library(rstanarm)
library(bayesplot)
options(mc.cores = parallel::detectCores())

m1 <- stan_glm(
  weight ~ height_c,
  data = d,
  family = gaussian,
  prior_intercept = normal(60, 10),
  prior = normal(0.55, 0.1),
  prior_aux = exponential(0.2),
  chains = 4,
  iter = 500,
  seed = 1234
)
  • By default, rstanarm samples from the posterior. To get back the prior predictive distribution (instead of doing it in R) use prior_PD = TRUE

Looking at the Model Summary

summary(m1)

Model Info:

 function:     stan_glm
 family:       gaussian [identity]
 formula:      weight ~ height_c
 algorithm:    sampling
 sample:       1000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 352
 predictors:   2

Estimates:
              mean   sd   10%   50%   90%
(Intercept) 45.0    0.2 44.7  45.0  45.3 
height_c     0.6    0.0  0.6   0.6   0.7 
sigma        4.2    0.2  4.0   4.2   4.5 

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 45.0    0.3 44.6  45.0  45.4 

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  824  
height_c      0.0  1.0  994  
sigma         0.0  1.0  867  
mean_PPD      0.0  1.0  970  
log-posterior 0.1  1.0  459  

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
  • If you can examine the priors by running prior_summary(m1)

Evaluting Quality of the Inferences

neff_ratio(m1) |> round(2)
(Intercept)    height_c       sigma 
       0.82        0.99        0.87 
rhat(m1) |> round(2)
(Intercept)    height_c       sigma 
          1           1           1 
mcmc_trace(m1, size = 0.3)

Same Model in Stan

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
  int<lower=0, upper=1> prior_PD;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
transformed parameters {
  vector[N] mu = alpha + beta * x;
}
model {
  alpha ~ normal(60, 10);
  beta ~ normal(0.55, 0.1);
  sigma ~ exponential(0.2);
  if (!prior_PD) {
    y ~ normal(mu, sigma);
  }
}
generated quantities {
  array[N] real yrep = normal_rng(mu, sigma);
}
  • You can pass prior_PD as a flag to enable drawing from the prior predictive distribution

Prior vs Posterior

  • Comparing the prior to the posterior tells us how much the model learned from data
  • It also helps us to validate if our priors were reasonable
  • In rstanarm, you can use the posterior_vs_prior function

  • Your prior should cover the plausible range of parameter values
  • When we don’t have a lot of data and parameters are complex, setting good priors takes work, but there are guidelines

Examining the Posterior

library(tidybayes)

draws <- spread_draws(m1, `(Intercept)`, height_c, sigma)
knitr::kable(head(round(draws, 2)))
.chain .iteration .draw (Intercept) height_c sigma
1 1 1 44.97 0.62 4.09
1 2 2 45.06 0.61 4.21
1 3 3 45.02 0.63 4.13
1 4 4 45.07 0.64 4.32
1 5 5 44.66 0.64 4.38
1 6 6 45.31 0.62 4.12
  • spread_draws will arrange the inferences in columns (wide format)
  • gather_draws will arrange the inferences in rows (long format), which is usually more convenient for plotting and computation

Examining the Posterior

options(digits = 3)
draws <- gather_draws(m1, `(Intercept)`, height_c, sigma)
knitr::kable(tail(draws, 4))
.chain .iteration .draw .variable .value
4 247 997 sigma 4.09
4 248 998 sigma 4.22
4 249 999 sigma 4.20
4 250 1000 sigma 4.11
draws |> mean_qi(.width = 0.90) |> knitr::kable() # also see ?median_qi(), etc
.variable .value .lower .upper .width .point .interval
(Intercept) 45.000 44.635 45.36 0.9 mean qi
height_c 0.624 0.576 0.67 0.9 mean qi
sigma 4.250 3.996 4.53 0.9 mean qi
  • From the above table: \(\E(y \mid x^c) (\text{kg}) = 45 (\text{kg}) + 0.62 (\text{kg/cm}) \cdot x^c (\text{cm})\)

Variability in Parameter Inferences

  • The code in section 9.4 in the book doesn’t work as the function and variable names have changed
dpred <- d |> 
  # same as add_epred_draws for lin reg not for other GLMs
  add_linpred_draws(m1, ndraws = 50)
head(dpred, 3)
# A tibble: 3 × 10
# Groups:   height, weight, age, male, height_c, .row [1]
  height weight   age  male height_c  .row .chain .iteration .draw .linpred
   <dbl>  <dbl> <dbl> <dbl>    <dbl> <int>  <int>      <int> <int>    <dbl>
1   152.   47.8    63     1    -2.83     1     NA         NA     1     43.3
2   152.   47.8    63     1    -2.83     1     NA         NA     2     42.9
3   152.   47.8    63     1    -2.83     1     NA         NA     3     43.0
p <- dpred |> ggplot(aes(x = height_c, y = weight)) +
  geom_line(aes(y = .linpred, group = .draw), 
            alpha = 0.1) + 
  geom_point(data = d, size = 0.05) +
  ylab("Weight (kg)") + 
  xlab("Centered Height (cm)") +
  ggtitle("100 draws from the slope/intercept posterior")
print(p)

Posterior Predictions

  • Suppose we are interested in predicting the weight of a person with a height of 160 cm
  • This corresponds to the centered height of 5.4: (\(160 - \bar{x}\))
  • We can now compute the distribution of the mean weight of a 160 cm person (reflecting variability in the slope and intercept only):
    • \(\mu = \alpha + \beta \cdot 5.4\), for each posterior draw
  • And a predictive distribution:
    • \(y_{\text{pred}} \sim \text{Normal}(\mu, \sigma)\)
draws <- spread_draws(m1, `(Intercept)`, height_c, sigma)
draws <- draws |>
  mutate(mu = `(Intercept)` + height_c * 5.4,
         y_pred = rnorm(nrow(draws), mu, sigma))
draws[1:3, 4:8]
# A tibble: 3 × 5
  `(Intercept)` height_c sigma    mu y_pred
          <dbl>    <dbl> <dbl> <dbl>  <dbl>
1          45.0    0.619  4.09  48.3   43.5
2          45.1    0.607  4.21  48.3   50.2
3          45.0    0.628  4.13  48.4   46.6

Posterior Predictions

  • We can compare predictive and average densities
  • Left panel showing the densities on their own
  • Right panel showing the same densities in the context of raw observations
mqi <- draws |> median_qi(.width = 0.90)
select(mqi, contains(c('mu', 'y_pred'))) |> round(2)
    mu mu.lower mu.upper y_pred y_pred.lower y_pred.upper
1 48.4     47.9     48.8   48.2         41.2         55.2

RStanArm Prediction Functions

  • posterior_linpred returns \(D \times N\) matrix with D draws and N data points
    • \(\eta_n = \alpha + \sum_{p=1}^P \beta_p x_{np}\), where \(P\) is the total number of regression inputs
  • posterior_epred returns an \(D \times N\) matrix that applies the inverse link (in GLMs) to the linear predictor \(\eta\)
    • \(\mu_n = \E(y | x_n)\); this is the same as \(\eta\) in Lin Regression
  • posterior_predict returns an \(D \times N\) matrix of predictions: \(y \mid \mu_n\)

Predictions in RStanArm

  • Posterior linear predictor
eta <- posterior_linpred(m1, newdata = data.frame(height_c = 5.4))
quantile(eta, probs = c(0.05, 0.50, 0.95)) |> round(2)
  5%  50%  95% 
47.9 48.4 48.8 
glue::glue('From the R simulation, 90% interval for eta = [{mqi$mu.lower |> round(2)}, {mqi$mu.upper |> round(2)}]')
From the R simulation, 90% interval for eta = [47.91, 48.82]
  • Posterior conditional mean
mu <- posterior_epred(m1, newdata = data.frame(height_c = 5.4))
quantile(mu, probs = c(0.05, 0.50, 0.95)) |> round(2)
  5%  50%  95% 
47.9 48.4 48.8 
  • Posterior prediction
y_pred <- posterior_predict(m1, newdata = data.frame(height_c = 5.4))
quantile(y_pred, probs = c(0.05, 0.50, 0.95)) |> round(2)
  5%  50%  95% 
41.6 48.5 55.0 
glue::glue('From the R simulation, 90% interval for y_pred = [{mqi$y_pred.lower |> round(2)}, {mqi$y_pred.upper |> round(2)}]')
From the R simulation, 90% interval for y_pred = [41.24, 55.25]

Evaluting Model Quality



Evaluating Quality of the Model

  • There are at least two stages of model evaluation: 1) the quality of the draws and 2) the quality of predictions
  • There is also a question of model fairness
    • How were the data collected?
    • People will likely interpret the results causally, even if not appropriate
    • How will the model be used?
    • Example: Correctional Offender Management Profiling for Alternative Sanctions (COMPAS)
  • Just because the draws have good statistical properties (e.g., good mixing, low auto-correlation, etc.), it does not mean the model will perform well
  • Model performance is assessed on how well it can make predictions, minimize costs and/or maximize benefits. Predictive accuracy is a common way of evaluating model performance.

Evaluating Quality of the Model

  • Once we are satisfied that the draws are statistically well-behaved, we can focus on evaluating predictive performance
  • We typically care about predictive performance out-of-sample
  • In general, a good model is well-calibrated and makes sharp predictions (Gneiting et al. 2007)1
  • For in-sample assessment, we perform Posterior Predictive Checks or PPCs
  • To assess out-of-sample performance, we rely on cross-validation or its approximations
  • If you are making causal (counterfactual) predictions, naive cross-validation will not work. Why?

Model Evaluation

Here is the high-level plan of attack:

  • Fit a linear model to the full !Kung dataset (not just adults) and let RStanArm pick the priors (don’t do this at home)
  • We know that this model is not quite right
  • Evaluate the model fit
  • Fix the model by thinking about the relationship between height and weight
  • Evaluate the improved model
  • Compare the linear model to the improved model

Model Evaluation

  • The following fits a linear model to the full dataset (not just adults)
m2 <- stan_glm(
  weight ~ height,
  data = d,
  family = gaussian,
  chains = 4,
  iter = 600,
  seed = 1234
)
summary(m2)
  • There were no sampling problems
  • And posterior draws look well-behaved
  • But how good is this model?

Model Info:

 function:     stan_glm
 family:       gaussian [identity]
 formula:      weight ~ height
 algorithm:    sampling
 sample:       1200 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 544
 predictors:   2

Estimates:
              mean   sd    10%   50%   90%
(Intercept) -33.8    1.1 -35.1 -33.7 -32.4
height        0.5    0.0   0.5   0.5   0.5
sigma         5.0    0.1   4.8   5.0   5.2

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 35.6    0.3 35.2  35.6  36.0 

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  1198 
height        0.0  1.0  1228 
sigma         0.0  1.0   836 
mean_PPD      0.0  1.0  1126 
log-posterior 0.1  1.0   497 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Visual Posterior Predictive Checks

  • The idea behind PPCs is to compare the distribution of the observation to posterior predictions
  • We already saw an example of how to do it by hand
  • Here, we will do using functions in RStanArm
library(bayesplot)
# bayesplot::pp_check(m2) <-- shortcut

# predict for every observed point
yrep <- posterior_predict(m2) 

# select 50 draws at random
s <- sample(nrow(yrep), 50)

# plot data against predictive densities
ppc_dens_overlay(d$weight, yrep[s, ]) 

Visual Posterior Predictive Checks

  • We can also compute the distributions of test statistics:
library(gridExtra)
p1 <- ppc_stat(d$weight, yrep, stat = "mean"); p2 <- ppc_stat(d$weight, yrep, stat = "sd")
q25 <- function(y) quantile(y, 0.25); q75 <- function(y) quantile(y, 0.75)
p3 <- ppc_stat(d$weight, yrep, stat = "q25"); p4 <- ppc_stat(d$weight, yrep, stat = "q75"); 
grid.arrange(p1, p2, p3, p4, ncol = 4)

  • ppc_stat is a shorthand for computation on each posterior (predictive) draw
  • For example, ppc_stat(d$weight, yrep, stat = "mean") is equivalent to:
    • Setting \(T(y) = \text{mean(d\$weight)}\) and \(T_{\text{yrep}} = \text{rowMeans(yrep)}\)

Visual Posterior Predictive Checks

  • Since we have a distribution at each observation point, we can plot the predictive distribution at each observation
  • We will first randomly select a subset of 50 people
  • Each column of yrep is a prediction for each observation point
s <- sample(ncol(yrep), 50)

bayesplot::ppc_intervals(
  d$weight[s],
  yrep = yrep[, s],
  x = d$height[s],
  prob = 0.5,
  prob_outer = 0.90
) + xlab("Height (cm)")  + 
  ylab("Weight (kg)") +
  ggtitle("Predicted vs observed weight")

Quantifying Model Accuracy

  • We looked at some visual evidence for in-sample model accuracy
  • The model is clearly doing a poor job of capturing observed data, particularly in the lower quantiles of the predictive distribution
  • In a situation like this, we would typically proceed to model improvements
  • Often, the model is not as bad as this one, and we would like to get some quantitative measures of model fit1
  • We do this so we can assess the relative performance of the next set of models

Quantifying Model Accuracy

  • One way to assess model accuracy is to compute an average square deviation from point prediction: mean square error
    • \(\text{RMSE} = \sqrt{\frac{1}{N} \sum_{n=1}^{N} (y_n - \E(y_n|\theta))^2}\)
    • Or its scaled version, which divides the summand by \(\V(y_i | \theta)\)
  • These measures do not work well for non-normal models
  • On the plus side, it is intuitive (why?) and computation is easy
  • Note that if we just average the errors, we will get zero by definition
# avarage error
mean(d$weight - colMeans(yrep)) |> round(2)
[1] -0.01
# root mean square error
mean((d$weight - colMeans(yrep))^2) |> sqrt() |> round(2)
[1] 4.98
# your book reports median absolute error
median(abs(d$weight - colMeans(yrep)))
[1] 3.58

Quantifying Model Accuracy

  • In a probabilistic prediction, we can assess model calibration
  • Calibration says that, for example, 50% intervals contain approximately 50% of the observations, and so on
  • A well-calibrated model may still be a bad model (see below) as the uncertainty may be too wide; for two models with the same calibration, we prefer the one with lower uncertainty
inside <- function(y, obs) return(obs >= y[1] & obs <= y[2])
calib  <- function(y, data, interval) {
  mid <- interval / 2
  l <- 0.5 - mid; u <- 0.5 + mid
  intervals <- apply(y, 2, quantile, probs = c(l, u))
  is_inside <- numeric(ncol(intervals))
  for (i in 1:ncol(intervals)) {
    is_inside[i] <- inside(intervals[, i], data[i])
  }
 return(mean(is_inside))
}
calib(yrep, d$weight, 0.50)
[1] 0.478
calib(yrep, d$weight, 0.90)
[1] 0.912

Quantifying Model Accuracy

  • A popular “proper” scoring rule for probabilistic prediction is log score
  • This is equivalent to the log predictive density \(\log f(y \mid \theta)\)
  • We would like to estimate the model’s ability to predict data it has not seen, even though we do not observe the true data-generating process
  • This quantity is approximated by log-pointwise-predictive-density \[ \text{lppd} = \sum_{n=1}^{N} \log \left( \frac{1}{S} \sum_{s=1}^{S} f(y_n \mid \theta_{-n, s}) \right) \]
  • To compute this quantity, fit the model N times, dropping one point at a time (-n)
  • S is the number of posterior samples, and \(\theta_{-n, s}\), is the s-th sample without the n-th data point
  • PSIS-LOO (Pareto Smoothed Importance Sampling, Vehtari et al, 2017) provides a computationally efficient way to approximate LOO, without refitting the model N times

Quantifying Model Accuracy

library(loo)
lm2 <- loo(m2)
lm2

Computed from 1200 by 544 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1648.7 14.3
p_loo         3.0  0.3
looic      3297.3 28.6
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.0]).

All Pareto k estimates are good (k < 0.68).
See help('pareto-k-diagnostic') for details.
plot(lm2)