Bayesian Inference

NYU Applied Statistics for Social Science Research

Eric Novik

Spring 2025 | Lecture 5

New York University

Bayesian linear regression and model evaluation

Introducing linear regression
Prior predictive simulations
Sampling from the posterior
Example of linear regression in Stan
Evaluating the quality of the draws
Posterior predictions
Cross-validation, ELPD, and LOO

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathcal{L}} \DeclareMathOperator{\I}{\text{I}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \]

Motivating Example

We borrow this example from Richard McElreath’s Statistical Rethinking
The data sets provided have been produced between 1969 to 2008, based on Nancy Howell’s observations of the !Kung San
From Wikipedia: “The ǃKung are one of the San peoples who live mostly on the western edge of the Kalahari desert, Ovamboland (northern Namibia and southern Angola), and Botswana.”

Howell Dataset

Data sample and summary:

height	weight	age	male
151.765	47.82561	63	1
139.700	36.48581	63	0
136.525	31.86484	65	0
156.845	53.04191	41	1
145.415	41.27687	51	0
163.830	62.99259	35	1

     height           weight            age       
 Min.   : 53.98   Min.   : 4.252   Min.   : 0.00  
 1st Qu.:125.09   1st Qu.:22.008   1st Qu.:12.00  
 Median :148.59   Median :40.058   Median :27.00  
 Mean   :138.26   Mean   :35.611   Mean   :29.34  
 3rd Qu.:157.48   3rd Qu.:47.209   3rd Qu.:43.00  
 Max.   :179.07   Max.   :62.993   Max.   :88.00

Notice a non-linearity
Thinking about why this should be, can give you an insight into how to model these data

Howell Dataset

For now, we will focus on the linear subset of the data
We will demonstrate the non-linear model at the end
We will restrict our attention to adults (age > 18)

General Approach

Assess the scope of the inferences
Set up reasonable priors and likelihood
Perform a forward simulation with fixed parameter values and try to recover them by running an inference algorithm. See this example for a standard epidemiological model.
Perform a prior predictive simulation
Possibly adjust your priors
Fit the model to data
Assess the quality of the inferences and the quality of the model
Improve or fix your model and go back to #3
Unless you are doing causal inference, you should interpret your coefficients as comparisons (RAOS, Page 84)

Howell Regression

We will build a predictive model for adult Weight $y$ given Height $x$ using the Howell dataset
Initial stab at the model: \[ \begin{eqnarray} y_i & \sim & \text{Normal}(\mu_i, \, \sigma)\\ \mu_i & = & \alpha + \beta x_i \\ \alpha & \sim & \text{Normal}(\alpha_{l}, \, \alpha_s) \\ \beta & \sim & \text{Normal}(\beta_{l}, \, \beta_s) \\ \sigma & \sim & \text{Exp}(r) \\ \end{eqnarray} \]
We have to specify $\alpha_{l}$ and $\alpha_s$, where l and s signify location and scale, and r, the rate of the exponential
If we work on the original scale for $x$, it is awkward to choose a prior for the intercept . Why? Discuss…
This can be fixed by subtracting the average height from $x$

Howell Regression

We define a new variable, the centered version of $x$: $x^c_i = x_i - \bar{x}$
Now $\alpha$ corresponds to the weight of an average person
Checking Wikipedia reveals that the average weight of a person in Africa is about 60 kg
They don’t state the standard deviation, but it is unlikely that an African adult would weigh less than 30 kg and more than 120 kg so we will set the prior sd = 10 \[ \begin{eqnarray} y_i & \sim & \text{Normal}(\mu_i, \, \sigma)\\ \mu_i & = & \alpha + \beta x^c_i \\ \alpha & \sim & \text{Normal}(60, \, 10) \\ \beta & \sim & \text{Normal}(\beta_{l}, \, \beta_s) \\ \sigma & \sim & \text{Exp}(r) \\ \end{eqnarray} \]
What about the slope $\beta$?

Howell Regression

In this dataset, the units of $\beta$ are $\frac{kg}{cm}$, since the units of height are $cm$
Should $\beta$ be positive. Why?
Do you think $\beta$ is likely to be less than 1. Why?
We can consult height-weight tables for the expected value and variance
In the dataset, $\E(\beta) = 0.55$ with a standard error of 0.006, but since we are uncertain how applicable that is to !Kung, we will allow the prior to vary more \[ \begin{eqnarray} y_i & \sim & \text{Normal}(\mu_i, \, \sigma)\\ \mu_i & = & \alpha + \beta x^c_i \\ \alpha & \sim & \text{Normal}(60, \, 10) \\ \beta & \sim & \text{Normal}(0.55, \, 0.1) \\ \sigma & \sim & \text{Exp}(r) \\ \end{eqnarray} \]
What about the error term $\sigma$?

Howell Regression

We know that $\sigma$ must be positive, so a possible choice for the prior is $\text{Normal}^+$, Exponential, etc.
At this stage, the key is to rule out implausible values, not to get something precise, particularly since we have enough data (> 340 observations)
From the background data, the residual standard error was 4.6. What’s a reasonable value for $\lambda$, in $\text{Exp}(\lambda)$?

\[ \begin{eqnarray} y_i & \sim & \text{Normal}(\mu_i, \, \sigma)\\ \mu_i & = & \alpha + \beta x^c_i \\ \alpha & \sim & \text{Normal}(60, \, 10) \\ \beta & \sim & \text{Normal}(0.55, \, 0.1) \\ \sigma & \sim & \text{Exp}(0.2) \\ \end{eqnarray} \]

Prior Predictive Simulation

d <- d |>
  mutate(height_c = height - mean(height))
round(mean(d$height_c), 2)

[1] 0

head(d)

# A tibble: 6 × 5
  height weight   age  male height_c
   <dbl>  <dbl> <dbl> <dbl>    <dbl>
1   152.   47.8    63     1    -2.83
2   140.   36.5    63     0   -14.9 
3   137.   31.9    65     0   -18.1 
4   157.   53.0    41     1     2.25
5   145.   41.3    51     0    -9.18
6   164.   63.0    35     1     9.23

prior_pred <- function(data) {
  alpha <- rnorm(1, 60, 10)
  beta <- rnorm(1, 0.55, 0.1)
  sigma <- rexp(1, 0.2)
  l <- nrow(data); y <- numeric(l)
  for (i in 1:l) {
    mu <- alpha + beta * data$height_c[i]
    y[i] <- rnorm(1, mu, sigma)
  }
  return(y)
}

n <- 1000
pr_p <- replicate(n = n, prior_pred(d))
# using library(purrr) functional primitives:
# pr_p <- map(1:n, \(i) prior_pred(d))
dim(pr_p)

[1]  352 1000

round(pr_p[1:12, 1:8], 2)

       [,1]  [,2]  [,3]  [,4]  [,5]   [,6]  [,7]  [,8]
 [1,] 68.38 59.26 58.14 66.40 60.81  74.64 63.45 61.32
 [2,] 58.37 54.24 53.54 59.84 55.22  87.86 55.47 57.89
 [3,] 52.38 53.47 52.19 57.18 54.96  77.97 54.15 43.72
 [4,] 69.70 64.38 60.14 70.26 64.70  75.14 67.81 57.68
 [5,] 65.08 49.95 55.73 64.01 57.76  85.81 61.81 57.96
 [6,] 70.73 64.37 63.00 74.97 68.06  99.57 68.37 63.37
 [7,] 64.99 57.32 57.12 65.58 62.86  86.61 59.94 51.36
 [8,] 74.55 72.36 64.88 77.58 70.15  73.88 71.09 70.86
 [9,] 61.22 54.13 56.87 64.44 60.36 107.70 58.71 53.61
[10,] 74.22 60.27 63.46 75.56 70.47 101.40 67.70 63.77
[11,] 66.38 54.55 59.17 68.79 67.43  71.89 60.84 60.53
[12,] 56.37 56.01 57.88 66.41 61.67  63.86 62.86 54.59

Prior Predictive Simulation

data_mean <- mean(d$weight)
mean_dist <- colMeans(pr_p)
ggplot(data.frame(data_mean, mean_dist), aes(mean_dist)) +
  geom_histogram() +
  geom_vline(xintercept = data_mean, color = 'red') +
  xlab("Distribution of Weight means (kg) under the prior") + ylab("") +
  theme(axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank())

Prior Predictive Simulation

To get a sense for the possible regression lines implied by the prior, we can fit a linear model to each simulation draw, and plot the lines over observations

N <- 100
s100 <- pr_p[, sample(ncol(pr_p), N)]
intercepts <- numeric(N)
slopes <- numeric(N)
for (i in 1:N) {
  coefs <- coef(lm(s100[, i] ~ d$height_c))
  intercepts[i] <- coefs[1]
  slopes[i] <- coefs[2]
}

# using library(purrr) functional primitives:
# df <- pr_p |> map_dfr(\(y) coef(lm(y ~ d$height_c)))

p <- ggplot(aes(height_c, weight), data = d)
p + geom_point(size = 0.5) + ylim(20, 90) + 
  geom_abline(slope = slopes, 
              intercept = intercepts, 
              alpha = 1/6) +
  ylab("Weight (kg)") + 
  xlab("Centered Height (cm)") +
  ggtitle("Kalahari !Kung San people", 
          subtitle = "Prior predictive simulation")

Deriving a Posterior Distribution

We have seen how to derive the posterior and posterior predictive distribution
Three dimentional posterior: $f(\alpha, \beta, \sigma)$. Where is $\mu$?
We construct the posterior from the prior and data likelihood (for each $y_i$): \[ \begin{eqnarray} &\text{Prior: }f(\alpha, \beta, \sigma) = f_1(\alpha) f_2(\beta) f_3(\sigma) \\ &\text{Likelihood: }f(y \mid \alpha, \beta, \sigma) = \prod_{i=1}^{n}f_4(y_i \mid \alpha, \beta, \sigma) \\ &\text{Posterior: }f(\alpha,\beta,\sigma \mid y) = \frac{f_1(\alpha) f(_2\beta) f_3(\sigma) \cdot \left[\prod_{i=1}^{n}f_4(y_i \mid \alpha, \beta, \sigma) \right]} {\int\int\int f_1(\alpha) f_2(\beta) f_3(\sigma) \cdot \left[\prod_{i=1}^{n}f_4(y_i \mid \alpha, \beta, \sigma) \right] d\alpha \, d\beta \, d\sigma} \end{eqnarray} \]
To be more precise, we would indicate that $f_1, f_2$ and $f_4$ are Normal with different parameters, and $f_3$ is $\text{Exp}(0.2)$

Fitting the Model

Even though our prior is slightly off, 300+ observations is a lot in this case (big data!), and so we proceed to model fitting
We will use stan_glm() function in rstanarm
rstanarm has default priors, but you should specify your own:

library(rstanarm)
library(bayesplot)
options(mc.cores = parallel::detectCores())

m1 <- stan_glm(
  weight ~ height_c,
  data = d,
  family = gaussian,
  prior_intercept = normal(60, 10),
  prior = normal(0.55, 0.1),
  prior_aux = exponential(0.2),
  chains = 4,
  iter = 500,
  seed = 1234
)

By default, rstanarm samples from the posterior. To get back the prior predictive distribution (instead of doing it in R) use prior_PD = TRUE

Looking at the Model Summary

summary(m1)


Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      weight ~ height_c
 algorithm:    sampling
 sample:       1000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 352
 predictors:   2

Estimates:
              mean   sd   10%   50%   90%
(Intercept) 45.0    0.2 44.7  45.0  45.3 
height_c     0.6    0.0  0.6   0.6   0.7 
sigma        4.2    0.2  4.0   4.2   4.5 

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 45.0    0.3 44.6  45.0  45.4 

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  824  
height_c      0.0  1.0  994  
sigma         0.0  1.0  867  
mean_PPD      0.0  1.0  970  
log-posterior 0.1  1.0  459  

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

If you can examine the priors by running prior_summary(m1)

Evaluting Quality of the Inferences

neff_ratio(m1) |> round(2)

(Intercept)    height_c       sigma 
       0.82        0.99        0.87

rhat(m1) |> round(2)

(Intercept)    height_c       sigma 
          1           1           1

mcmc_trace(m1, size = 0.3)

Same Model in Stan

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
  int<lower=0, upper=1> prior_PD;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
transformed parameters {
  vector[N] mu = alpha + beta * x;
}
model {
  alpha ~ normal(60, 10);
  beta ~ normal(0.55, 0.1);
  sigma ~ exponential(0.2);
  if (!prior_PD) {
    y ~ normal(mu, sigma);
  }
}
generated quantities {
  array[N] real yrep = normal_rng(mu, sigma);
}

You can pass prior_PD as a flag to enable drawing from the prior predictive distribution

Prior vs Posterior

Comparing the prior to the posterior tells us how much the model learned from data
It also helps us to validate if our priors were reasonable
In rstanarm, you can use the posterior_vs_prior function

Your prior should cover the plausible range of parameter values
When we don’t have a lot of data and parameters are complex, setting good priors takes work, but there are guidelines

Examining the Posterior

library(tidybayes)

draws <- spread_draws(m1, `(Intercept)`, height_c, sigma)
knitr::kable(head(round(draws, 2)))

.chain	.iteration	.draw	(Intercept)	height_c	sigma
1	1	1	44.97	0.62	4.09
1	2	2	45.06	0.61	4.21
1	3	3	45.02	0.63	4.13
1	4	4	45.07	0.64	4.32
1	5	5	44.66	0.64	4.38
1	6	6	45.31	0.62	4.12

spread_draws will arrange the inferences in columns (wide format)
gather_draws will arrange the inferences in rows (long format), which is usually more convenient for plotting and computation

Examining the Posterior

options(digits = 3)
draws <- gather_draws(m1, `(Intercept)`, height_c, sigma)
knitr::kable(tail(draws, 4))

.chain	.iteration	.draw	.variable	.value
4	247	997	sigma	4.09
4	248	998	sigma	4.22
4	249	999	sigma	4.20
4	250	1000	sigma	4.11

draws |> mean_qi(.width = 0.90) |> knitr::kable() # also see ?median_qi(), etc

.variable	.value	.lower	.upper	.width	.point	.interval
(Intercept)	45.000	44.635	45.36	0.9	mean	qi
height_c	0.624	0.576	0.67	0.9	mean	qi
sigma	4.250	3.996	4.53	0.9	mean	qi

From the above table: $\E(y \mid x^c) (\text{kg}) = 45 (\text{kg}) + 0.62 (\text{kg/cm}) \cdot x^c (\text{cm})$

Variability in Parameter Inferences

The code in section 9.4 in the book doesn’t work as the function and variable names have changed

dpred <- d |> 
  # same as add_epred_draws for lin reg not for other GLMs
  add_linpred_draws(m1, ndraws = 50)
head(dpred, 3)

# A tibble: 3 × 10
# Groups:   height, weight, age, male, height_c, .row [1]
  height weight   age  male height_c  .row .chain .iteration .draw .linpred
   <dbl>  <dbl> <dbl> <dbl>    <dbl> <int>  <int>      <int> <int>    <dbl>
1   152.   47.8    63     1    -2.83     1     NA         NA     1     43.3
2   152.   47.8    63     1    -2.83     1     NA         NA     2     42.9
3   152.   47.8    63     1    -2.83     1     NA         NA     3     43.0

p <- dpred |> ggplot(aes(x = height_c, y = weight)) +
  geom_line(aes(y = .linpred, group = .draw), 
            alpha = 0.1) + 
  geom_point(data = d, size = 0.05) +
  ylab("Weight (kg)") + 
  xlab("Centered Height (cm)") +
  ggtitle("100 draws from the slope/intercept posterior")
print(p)

Posterior Predictions

Suppose we are interested in predicting the weight of a person with a height of 160 cm
This corresponds to the centered height of 5.4: ($160 - \bar{x}$)
We can now compute the distribution of the mean weight of a 160 cm person (reflecting variability in the slope and intercept only):
- $\mu = \alpha + \beta \cdot 5.4$, for each posterior draw
And a predictive distribution:
- $y_{\text{pred}} \sim \text{Normal}(\mu, \sigma)$

draws <- spread_draws(m1, `(Intercept)`, height_c, sigma)
draws <- draws |>
  mutate(mu = `(Intercept)` + height_c * 5.4,
         y_pred = rnorm(nrow(draws), mu, sigma))
draws[1:3, 4:8]

# A tibble: 3 × 5
  `(Intercept)` height_c sigma    mu y_pred
          <dbl>    <dbl> <dbl> <dbl>  <dbl>
1          45.0    0.619  4.09  48.3   43.5
2          45.1    0.607  4.21  48.3   50.2
3          45.0    0.628  4.13  48.4   46.6

Posterior Predictions

We can compare predictive and average densities
Left panel showing the densities on their own
Right panel showing the same densities in the context of raw observations

mqi <- draws |> median_qi(.width = 0.90)
select(mqi, contains(c('mu', 'y_pred'))) |> round(2)

# A tibble: 1 × 6
     mu mu.lower mu.upper y_pred y_pred.lower y_pred.upper
  <dbl>    <dbl>    <dbl>  <dbl>        <dbl>        <dbl>
1  48.4     47.9     48.8   48.2         41.2         55.2

RStanArm Prediction Functions

posterior_linpred returns $D \times N$ matrix with D draws and N data points
- $\eta_n = \alpha + \sum_{p=1}^P \beta_p x_{np}$, where $P$ is the total number of regression inputs
posterior_epred returns an $D \times N$ matrix that applies the inverse link (in GLMs) to the linear predictor $\eta$
- $\mu_n = \E(y | x_n)$; this is the same as $\eta$ in Lin Regression
posterior_predict returns an $D \times N$ matrix of predictions: $y \mid \mu_n$

Predictions in RStanArm

Posterior linear predictor

eta <- posterior_linpred(m1, newdata = data.frame(height_c = 5.4))
quantile(eta, probs = c(0.05, 0.50, 0.95)) |> round(2)

  5%  50%  95% 
47.9 48.4 48.8

glue::glue('From the R simulation, 90% interval for eta = [{mqi$mu.lower |> round(2)}, {mqi$mu.upper |> round(2)}]')

From the R simulation, 90% interval for eta = [47.91, 48.82]

Posterior conditional mean

mu <- posterior_epred(m1, newdata = data.frame(height_c = 5.4))
quantile(mu, probs = c(0.05, 0.50, 0.95)) |> round(2)

  5%  50%  95% 
47.9 48.4 48.8

Posterior prediction

y_pred <- posterior_predict(m1, newdata = data.frame(height_c = 5.4))
quantile(y_pred, probs = c(0.05, 0.50, 0.95)) |> round(2)

  5%  50%  95% 
41.6 48.5 55.0

glue::glue('From the R simulation, 90% interval for y_pred = [{mqi$y_pred.lower |> round(2)}, {mqi$y_pred.upper |> round(2)}]')

From the R simulation, 90% interval for y_pred = [41.24, 55.25]

Evaluting Model Quality

Evaluating Quality of the Model

There are at least two stages of model evaluation: 1) the quality of the draws and 2) the quality of predictions
There is also a question of model appropriateness -> fairness
- How were the data collected? Who ended up in your sample?
- People will likely interpret the results causally, even if not appropriate
- How will the model be used?
- Example: Correctional Offender Management Profiling for Alternative Sanctions (COMPAS)
Just because the draws have good statistical properties (e.g., good mixing, low auto-correlation, etc.), it does not mean the model will predict well
Model performance is assessed on how well it can make predictions, minimize costs and/or maximize benefits. Predictive accuracy is a common way of evaluating model performance.

Evaluating Quality of the Model

Once we are satisfied that the draws are statistically well-behaved, we can focus on evaluating predictive performance
We typically care about predictive performance out-of-sample
In general, a good model is well-calibrated and makes sharp predictions (Gneiting et al. 2007)¹
For in-sample assessment, we perform Posterior Predictive Checks or PPCs
To assess out-of-sample performance, we rely on cross-validation or its approximations
If you are making causal (counterfactual) predictions, naive cross-validation will not work. Why?

Model Evaluation

Here is the high-level plan of attack:

Fit a linear model to the full !Kung dataset (not just adults) and let RStanArm pick the priors (don’t do this at home)
We know that this model is not quite right
Evaluate the model fit
Fix the model by thinking about the relationship between height and weight
Evaluate the improved model
Compare the linear model to the improved model

Model Evaluation

The following fits a linear model to the full dataset (not just adults)

m2 <- stan_glm(
  weight ~ height,
  data = d,
  family = gaussian,
  chains = 4,
  iter = 600,
  seed = 1234
)
summary(m2)

There were no sampling problems
And posterior draws look well-behaved
But how good is this model?


Model Info:

 function:     stan_glm
 family:       gaussian [identity]
 formula:      weight ~ height
 algorithm:    sampling
 sample:       1200 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 544
 predictors:   2

Estimates:
              mean   sd    10%   50%   90%
(Intercept) -33.8    1.1 -35.1 -33.7 -32.4
height        0.5    0.0   0.5   0.5   0.5
sigma         5.0    0.1   4.8   5.0   5.2

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 35.6    0.3 35.2  35.6  36.0 

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  1198 
height        0.0  1.0  1228 
sigma         0.0  1.0   836 
mean_PPD      0.0  1.0  1126 
log-posterior 0.1  1.0   497 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Visual Posterior Predictive Checks

The idea behind PPCs is to compare the distribution of the observation to posterior predictions
We already saw an example of how to do it by hand
Here, we will do using functions in RStanArm

library(bayesplot)
# bayesplot::pp_check(m2) <-- shortcut

# predict for every observed point
yrep <- posterior_predict(m2) 

# select 50 draws at random
s <- sample(nrow(yrep), 50)

# plot data against predictive densities
ppc_dens_overlay(d$weight, yrep[s, ])

Visual Posterior Predictive Checks

We can also compute the distributions of test statistics:

library(gridExtra)
p1 <- ppc_stat(d$weight, yrep, stat = "mean"); p2 <- ppc_stat(d$weight, yrep, stat = "sd")
q25 <- function(y) quantile(y, 0.25); q75 <- function(y) quantile(y, 0.75)
p3 <- ppc_stat(d$weight, yrep, stat = "q25"); p4 <- ppc_stat(d$weight, yrep, stat = "q75"); 
grid.arrange(p1, p2, p3, p4, ncol = 4)

Mean and sd are not good test statistics in this case!
ppc_stat is a shorthand for computation on each posterior (predictive) draw
For example, ppc_stat(d$weight, yrep, stat = "mean") is equivalent to:
- Setting $T(y) = \text{mean(d\$weight)}$ and $T_{\text{yrep}} = \text{rowMeans(yrep)}$

Visual Posterior Predictive Checks

Since we have a distribution at each observation point, we can plot the predictive distribution at each observation
We will first randomly select a subset of 50 people
Each column of yrep is a prediction for each observation point

s <- sample(ncol(yrep), 50)

bayesplot::ppc_intervals(
  d$weight[s],
  yrep = yrep[, s],
  x = d$height[s],
  prob = 0.5,
  prob_outer = 0.90
) + xlab("Height (cm)")  + 
  ylab("Weight (kg)") +
  ggtitle("Predicted vs observed weight")

Quantifying Model Accuracy

We looked at some visual evidence for in-sample model accuracy
The model is clearly doing a poor job of capturing observed data, particularly in the lower quantiles of the predictive distribution
In a situation like this, we would typically proceed to model improvements
Often, the model is not as bad as this one, and we would like to get some quantitative measures of model fit¹
We do this so we can assess the relative performance of the next set of models

Quantifying Model Accuracy

One way to assess model accuracy is to compute an average square deviation from point prediction: mean square error
- $\text{RMSE} = \sqrt{\frac{1}{N} \sum_{n=1}^{N} (y_n - \E(y_n|\theta))^2}$
- Or its scaled version, which divides the summand by $\V(y_n | \theta)$
These measures do not work well for non-normal models
On the plus side, it is intuitive (why?) and computation is easy
Note that if we just average the errors, we will get zero by definition

# avarage error
mean(d$weight - colMeans(yrep)) |> round(2)

[1] -0.01

# root mean square error
mean((d$weight - colMeans(yrep))^2) |> sqrt() |> round(2)

[1] 4.98

# your book reports median absolute error
median(abs(d$weight - colMeans(yrep)))

[1] 3.58

Quantifying Model Accuracy

In a probabilistic prediction, we can assess model calibration
Calibration says that, for example, 50% intervals contain approximately 50% of the observations, and so on
A well-calibrated model may still be a bad model (see below) as the uncertainty may be too wide; for two models with the same calibration, we prefer the one with lower uncertainty

inside <- function(y, obs) return(obs >= y[1] & obs <= y[2])
calib  <- function(y, data, interval) {
  mid <- interval / 2
  l <- 0.5 - mid; u <- 0.5 + mid
  intervals <- apply(y, 2, quantile, probs = c(l, u))
  is_inside <- numeric(ncol(intervals))
  for (i in 1:ncol(intervals)) {
    is_inside[i] <- inside(intervals[, i], data[i])
  }
 return(mean(is_inside))
}
calib(yrep, d$weight, 0.50)

[1] 0.478

calib(yrep, d$weight, 0.90)

[1] 0.912

Quantifying Model Accuracy

A popular “proper” scoring rule for probabilistic prediction is log score
This is equivalent to the log predictive density $\log f(y \mid \theta)$
We would like to estimate the model’s ability to predict data it has not seen, even though we do not observe the true data-generating process
This quantity is approximated by log-pointwise-predictive-density \[ \text{lppd} = \sum_{n=1}^{N} \log \left( \frac{1}{S} \sum_{s=1}^{S} f(y_n \mid \theta_{-n}^{(s)}) \right) \]
To compute this quantity, fit the model N times, dropping one point at a time (-n)
S is the number of posterior samples, and $\theta_{-n}^{(s)}$, is the s-th sample without the n-th data point
Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC provides a computationally efficient way to approximate LOO, without refitting the model N times.

Quantifying Model Accuracy

library(loo)
lm2 <- loo(m2)
lm2


Computed from 1200 by 544 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1648.7 14.3
p_loo         3.0  0.3
looic      3297.3 28.6
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.0]).

All Pareto k estimates are good (k < 0.68).
See help('pareto-k-diagnostic') for details.

plot(lm2)