Eric Novik – Bayesian Inference

Introductions

What is the city of your birth?

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathscr{L}} \DeclareMathOperator{\I}{\text{I}} \]

Two Truths and a Lie¹

One person tells three personal statements, one of which is a lie.
Others discuss and guess which statement is the lie, and they jointly construct a numerical statement of their certainty in the guess (on a 0–10 scale).
The storyteller reveals which was the lie.
Enter the certainty number and the outcome (success or failure) and submit in the Google form. Rotate through everyone in your group so that each person plays the storyteller role once.

Two Truths and a Lie

https://tinyurl.com/two-truths-and

Lecture 1: Bayesian Workflow

Two Truths and a Lie
Statistics vs AI/ML
Brief history of Bayesian inference
Review of basic probability
Introduction to Bayesian workflow
Bayes’s rule for events
Binomial model and the Bayesian Crank
Overview of the Course

Statistics vs. AI/ML

⚠️ What follows is an oversimplified opinion.

AI is great for automating tasks that humans find easy
- Recognizing faces, cats, and other objects
- Identifying tumors on a radiology scan
- Playing Chess and Go
- Driving a car
- Post ChatGPT 4: Generating coherent text/images/video, signs of reasoning

Statistics is great at answering questions that humans find hard
- How fast does a drug clear from the body?
- What is the expected tumor size two months after treatment?
- How would patients respond under a different treatment?
- Should I take this drug?
- Should we (FDA/EMA/…) approve this drug?

Brief History

Summary of the book The Theory That Would Not Die

Thomas Bayes (1702(?) — 1761) is credited with the discovery of the “Bayes’s Rule”
His paper was published posthumously by Richard Price in 1763
Laplace (1749 — 1827) independently discovered the rule and published it in 1774
Scientific context: Newton’s Principia was published in 1687
Bayesian wins: German Enigma cipher, search for a missing H-bomb, Federalist papers, Moneyball, FiveThirtyEight

Laplace’s Demon

We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at any given moment knew all of the forces that animate nature and the mutual positions of the beings that compose it, if this intellect were vast enough to submit the data to analysis, could condense into a single formula the movement of the greatest bodies of the universe and that of the lightest atom; for such an intellect nothing could be uncertain, and the future just like the past would be present before its eyes.

Marquis Pierre Simon de Laplace (1729 — 1827)

“Uncertainty is a function of our ignorance, not a property of the world.”

Modern Examples of Bayesian Analyses

Elasticity of Demand (Stan with Hierarchical Model)

Pharmacometrics (Stan with ODEs)

Nonparametric Bayes (Stan with GPs)

Review of Probability

A set of all possible outcomes is called a sample space and denoted by \(\Omega\)
An outcome of an experiment is denoted by \(\omega \in \Omega\)
We typically denote events by capital letters, say \(A \subseteq \Omega\)
Axioms of probability:
- \(\P(A) \geq 0, \, \text{for all } A\)
- \(\P(\Omega) = 1\)
- If \(A_1, A_2, A_3, \ldots\) are disjoint: \(\P(\cup_{i=1}^{\infty} A_i) = \sum_{i=1}^{\infty} \P(A_i)\)

You roll a fair six-sided die twice
Give an example of an \(\omega \in \Omega\)
How many elements are in \(\Omega\)? What is \(\Omega\)?
Define an event \(A\) as the sum of the two rolls less than 11
How many elements are in \(A\)?
What is \(\P(A)\)?

outer(1:6, 1:6, FUN = "+")

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    2    3    4    5    6    7
[2,]    3    4    5    6    7    8
[3,]    4    5    6    7    8    9
[4,]    5    6    7    8    9   10
[5,]    6    7    8    9   10   11
[6,]    7    8    9   10   11   12

\(\Omega\) consists of all pairs:
- \(\{(1, 1), (1, 2), ... (2, 1), (2, 2), ...\}\)

There are 36 such pairs
33 of those pairs result in a sum of less than 11
\(\P(A) = \frac{33}{36} = \frac{11}{12}\)

Random Variables Review

Review
Mapping

Random variable is not random – it is a deterministic mapping from the sample space onto the real line; randomness comes from the experiment
PMF, PDF, CDF (Blitzstein and Hwang, Ch. 3, 5)
Expectations (Blitzstein and Hwang, Ch. 4)
Joint Distributions (Blitzstein and Hwang, Ch. 7)
Conditional Expectations (Blitzstein and Hwang, Ch. 9)

Random variable X for the number of Heads in two flips

Example Simulation

We will use R’s sample() function to simulate rolls of a die and replicate() function to repeat the rolling process many times
Modern approach: purrr::map(1:n, \(x) expression)

die <- 1:6
roll <- function(x, n) {
  sample(x, size = n, replace = TRUE)
}
roll(die, 2) # roll the die twice

[1] 4 3

rolls <- replicate(1e4, roll(die, 2))
rolls[, 1:8] # print first 8 columns

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    6    4    1    6    6    2    5    1
[2,]    6    5    5    3    2    1    1    3

roll_sums <- colSums(rolls)
head(roll_sums)

[1] 12  9  6  9  8  3

mean(roll_sums < 11)

[1] 0.9159

Given a Random Variable \(Y\), \(y^{(1)}, y^{(2)}, y^{(3)}, \ldots, y^N\) are simulations or draws from \(Y\)
Fundamental bridge (Blitzstein & Hwang p. 164): \(\P(A) = \E(\I_A)\)
Computationally: \(\P(Y < 11) \approx \frac{1}{N} \sum^{N}_{n=1} \I(y^n < 11)\)
In R, roll_sums < 11 creates an indicator variable
And mean() does the average

Example Simulation with purrr

Roll a die twice, twenty times

sim <- purrr::map(1:20, \(x) roll(die, 2))
str(sim)

List of 20
 $ : int [1:2] 6 2
 $ : int [1:2] 4 4
 $ : int [1:2] 5 2
 $ : int [1:2] 6 2
 $ : int [1:2] 5 2
 $ : int [1:2] 5 1
 $ : int [1:2] 3 6
 $ : int [1:2] 3 4
 $ : int [1:2] 4 4
 $ : int [1:2] 3 2
 $ : int [1:2] 2 2
 $ : int [1:2] 5 6
 $ : int [1:2] 5 2
 $ : int [1:2] 4 2
 $ : int [1:2] 6 1
 $ : int [1:2] 4 1
 $ : int [1:2] 5 4
 $ : int [1:2] 1 2
 $ : int [1:2] 4 2
 $ : int [1:2] 3 2

Roll a die once, twice, …., twenty times

sim <- purrr::map(1:20, \(x) roll(die, x))
str(sim)

List of 20
 $ : int 6
 $ : int [1:2] 4 1
 $ : int [1:3] 1 2 3
 $ : int [1:4] 1 4 5 2
 $ : int [1:5] 6 2 3 4 4
 $ : int [1:6] 5 3 4 5 4 1
 $ : int [1:7] 1 5 4 5 5 4 3
 $ : int [1:8] 1 6 5 6 4 4 3 4
 $ : int [1:9] 3 3 6 6 5 1 4 1 6
 $ : int [1:10] 1 6 5 5 3 4 5 1 3 5
 $ : int [1:11] 4 4 2 4 6 2 1 2 2 3 ...
 $ : int [1:12] 2 3 2 6 4 2 6 3 3 4 ...
 $ : int [1:13] 2 3 2 2 2 5 1 3 6 2 ...
 $ : int [1:14] 3 2 6 4 2 1 3 2 6 6 ...
 $ : int [1:15] 3 4 3 5 1 6 1 2 5 4 ...
 $ : int [1:16] 6 2 2 6 2 6 3 5 2 6 ...
 $ : int [1:17] 3 5 3 6 5 4 2 1 5 2 ...
 $ : int [1:18] 5 1 3 6 6 6 6 3 4 6 ...
 $ : int [1:19] 5 5 6 1 6 5 1 2 4 3 ...
 $ : int [1:20] 4 6 4 3 4 3 4 3 5 1 ...

Review of Conditional Probability

For arbitrary \(A\) and \(B\), if \(\P(B) > 0\):
- Conditional probability: \(\P(A \mid B) = \frac{\P(AB)}{\P(B)}\)
- Conditional probability: \(\P(B \mid A) = \frac{\P(AB)}{\P(A)}\)
Bayes’s rule: \(\P(A \mid B) = \frac{\P(AB)}{\P(B)} = \frac{\P(B \mid A) \P(A)}{\P(B)}\)
\(A\) and \(B\) are independent if observing \(B\) does not give you any more information about \(A\): \(\P(A \mid B) = \P(A)\)
And \(\P(A B) = \P(A) \P(B)\) (from Bayes’s rule)

Joint, Marginal, and Conditional

Titanic carried approximately 2,200 passengers and sank on 15 April 1912
Let \(G: \{m, f\}\) represent Gender and \(S: \{n, y\}\) represent Survival

surv <- apply(Titanic, c(2, 4), sum) 
surv_prop <- round(surv / sum(surv), 3)
bind_cols(Sex = c("Male", "Female"), surv_prop) |> 
  janitor::adorn_totals(c("row", "col")) |>
  knitr::kable(caption = 
         "Titanic survival proportions")

Titanic survival proportions
Sex	No	Yes	Total
Male	0.620	0.167	0.787
Female	0.057	0.156	0.213
Total	0.677	0.323	1.000

\(\P(G = m \cap S = y) =\) 0.167
\(\P(S = y) = \sum_{i \in \{m, f\}} \P(S = y \, \cap G = i) =\) ??
\(\P(S = y) = \sum_{i \in \{m, f\}} \P(S = y \, \cap G = i) = 0.323\)

Joint, Marginal, and Conditional

What is \(\P(G = m \mid S = y)\), probability of being
male given survival?
To compute that, we only consider the column
where Survival = Yes
\(\P(G = m \mid S = y) = \frac{\P(G = m \, \cap \, S = y)}{\P(S = y)} = \frac{0.167}{0.323} \\ \approx 0.52\)
You want \(\P(S = y \mid G = m)\), comparing it to \(\\ \qquad \ \ \ \ \ \ \P(S = y \mid G = f)\)
\(\P(S = y \mid G = m) = \frac{\P(G = m \, \cap \, S = y)}{\P(G = m)} = \frac{0.167}{0.787} \\ \approx 0.21\)
\(\P(S = y \mid G = f) = \frac{\P(G = f \, \cap \, S = y)}{\P(G = f)} = \frac{0.156}{0.213} \\ \approx 0.73\)
How would you compute \(\P(S = n \mid G = m)\)?
\(\P(S = n \mid G = m) = 1 - \P(S = y \mid G = m)\)

Titanic survival proportions
Sex	No	Yes	Total
Male	0.620	0.167	0.787
Female	0.057	0.156	0.213
Total	0.677	0.323	1.000

“Untergang der Titanic”, as conceived by Willy Stöwer, 1912

Law of Total Probability (LOTP)

Let \(A\) be a partition of \(\Omega\), so that each \(A_i\) is disjoint, \(\P(A_i >0)\), and \(\cup A_i = \Omega\). \[ \P(B) = \sum_{i=1}^{n} \P(B \cap A_i) = \sum_{i=1}^{n} \P(B \mid A_i) \P(A_i) \]

Bayes’s Rule for Events

We can combine the definition of conditional probability with the LOTP to come up with Bayes’s rule for events, assuming \(\P(B) \neq 0\)

\[ \P(A \mid B) = \frac{\P(B \cap A)}{\P(B)} = \frac{\P(B \mid A) \P(A)}{\sum_{i=1}^{n} \P(B \mid A_i) \P(A_i)} \]

We typically think of \(A\) is some unknown we wish to learn (e.g., the status of a disease) and \(B\) as the data we observe (e.g., the result of a diagnostic test)
We call \(\P(A)\) prior probability of A (e.g., how prevalent is the disease in the population)
We call \(\P(A \mid B)\), the posterior probability of the unknown \(A\) given data \(B\)

Estimating \(\pi\) by Simulation

The idea is that we can approximate the ratio of the area of an inscribed circle, \(A_c\), to the area of the square, \(A_s\), by uniformly “throwing darts” at the square with the side \(2r\) and counting how many darts land inside the circle versus inside the square.

\[ \begin{align} A_{c}& = \pi r^2 \\ A_{s}& = (2r)^2 = 4r^2 \\ \frac{A_{c}}{A_{s}}& = \frac{\pi r^2}{4r^2} = \frac{\pi}{4} \implies \pi = \frac{4A_{c}}{A_{s}} \end{align} \]

Estimating \(\pi\) by Simulation

To estimate \(\pi\), we perform the following simulation:

\[ \begin{align} X& \sim \text{Uniform}(-1, 1) \\ Y& \sim \text{Uniform}(-1, 1) \\ \pi& \approx \frac{4 \sum_{i=1}^{N} \I(x_i^2 + y_i^2 < 1)}{N} \end{align} \]

The numerator is a sum over an indicator function \(\I\), which evaluates to \(1\) if the inequality holds and \(0\) otherwise.

Estimating \(\pi\) by Simulation

Code and Plot
Estimate

n <- 1e3
x <- runif(n, -1, 1); y <- runif(n, -1, 1)
inside <- x^2 + y^2 < 1
data <- tibble(x, y, inside)
p <- ggplot(aes(x = x, y = y), data = data)
p + geom_point(aes(color = inside)) + theme(legend.position = "none")

cat("Estimated value of pi =", 4*sum(inside) / n)

Estimated value of pi = 3.16

x	y	inside
-0.8929639	-0.5403777	FALSE
0.1237117	0.0664457	TRUE
-0.3972574	0.6440530	TRUE
-0.5569671	-0.9873628	FALSE
-0.7640633	-0.0044610	TRUE

Example Bayesian Analysis

Suppose we enrolled five people in an early cancer clinical trial
Each person was given an active treatment
From previous trials, we have some idea of historical response rates (proportion of people responding)
At the end of the trial, \(Y = y \in \{0,1,...,5 \}\) people will have responded¹ to the treatment
We are interested in estimating the probability that the response rate is greater or equal to 50%

Image from Fokko Smits, Martijn Dirksen, and Ivo Schoots: RECIST 1.1 - and more

Notation

Greek letters will be used for latent parameters, and English letters will be used for observables.

\(\theta\): unknowns or parameters to be estimated; could be multivariate, discrete, and continuous (your book uses \(\pi\))
\(y\): observations or measurements to be modelled (\(y_1, y_2, ...\))
\(\widetilde{y}\) : unobserved but observable quantities (in your book \(y'\))
\(x\): covariates
\(f( \theta )\): a prior model, P[DM]F of \(\theta\)
\(f_y(y \mid \theta, x)\): an observational model, P[DM]F when it is a function of \(y\) (in your book: \(f(y \mid \pi)\)); we typically drop the \(x\) to simplify the notation.
\(f_{\theta}(y \mid \theta)\): is a likelihood function when it is a function of \(\theta\) (in your book: \(\L(\pi \mid y)\))
Some people write \(\L(\theta; y)\) or simply \(\L(\theta)\)

General Approach

Before observing the data, we need to specify a prior model \(f(\theta)\) on all unknowns \(\theta\)¹
Pick a data model \(f(y \mid \theta, x)\) — this is typically more important than the prior; this includes the model for the conditional mean: \(\E(y \mid x)\)
For more complex models, we construct a prior predictive distribution, \(f(y)\)
- We will define this quantity later — it will help us assess if our choice of priors makes sense on the observational scale
After we observe data \(y\), we treat \(f(y \mid \theta, x)\) as the likelihood of observing \(y\) under all plausible values of \(\theta\), conditioning on \(x\) if necessary
Derive a posterior model for \(\theta\), \(\, f(\theta \mid y, x)\) using Bayes’s rule or by simulation
Evaluate model quality: 1) quality of the inferences; 2) quality of predictions. Revise the model if necessary
Compute all the quantities of interest from the posterior, such as event probabilities, e.g., \(\P(\theta > 0.5)\), posterior predictive distribution \(f(\widetilde{y} \mid y)\), decision functions, etc.

Example Prior Model

We will construct a prior model for our clinical trial
From previous trials, we construct a discretized version of the prior distribution of the response rate
The most likely value for response rate is 30%

dot_plot <- function(x, y) {
  p <- ggplot(data.frame(x, y), aes(x, y))
  p + geom_point(aes(x = x, y = y), size = 0.5) +
    geom_segment(aes(x = x, y = 0, xend = x, 
                     yend = y), linewidth = 0.2) +
    xlab(expression(theta)) + 
    ylab(expression(f(theta)))
}
theta <- c(0.10, 0.30, 0.50, 0.70, 0.90)
prior <- c(0.05, 0.45, 0.30, 0.15, 0.05)
dot_plot(theta, prior) +
  ggtitle("Prior probability of response")

Even though \(\theta\) is a continuous parameter, we can still specify a discrete prior
The posterior will also be discrete and of the same cardinality

Data Model

We consider each person’s response rate to be independent, given the treatment
We have a fixed number of people in the trial, \(N = 5\), and \(0\) to \(5\) successes
We will therefore consider: \(y | \theta \sim \text{Bin}(N,\theta) = \text{Bin}(5,\theta)\)
\(f(y \mid \theta) = \text{Bin} (y \mid 5,\theta) = \binom{5}{y} \theta^y (1 - \theta)^{5 - y}\) for \(y \in \{0,1,\ldots,5\}\)
Is this a valid probability distribution as a function of \(y\)?

N = 5; y <- 0:N; theta <- 0.5
(f_y <- dbinom(x = y, size = N, prob = theta) |>
    fractions())

[1] 1/32 5/32 5/16 5/16 5/32 1/32

sum(dbinom(x = y, size = N, prob = theta))

[1] 1

Likelihood Function

We ran the trial and observed 3 out of 5 responders
We can now construct a likelihood function for \(y = 3\) as a function of \(\theta\)
Let’s check if this function is a probability distribution: \[ f (y) = \int_{0}^{1} \binom{N}{y} \theta^y (1 - \theta)^{N - y}\, d\theta = \frac{1}{N + 1} \]

dbinom_theta <- function(theta, N, y) {
  choose(N, y) * theta^y * (1 - theta)^(N - y) 
}
integrate(dbinom_theta, lower = 0, upper = 1, 
          N = 5, y = 3)[[1]] |> fractions()

[1] 1/6

\(f(y)\) is called marginal distribution of the data or, more aptly, prior predictive distribution
It tells us that prior to observing \(y\), all values of \(y\) are equally likely

Likelihood Function

Compare with the data model:

Compute the Posterior

Compute the likelihood
- \(\binom{5}{3} \theta^3 (1 - \theta)^{2}\) for \(\theta \in \{0.10, 0.30, 0.50, 0.70, 0.90\}\)
Multiply the likelihood by the prior to compute the numerator
- \(f(y = 3 \mid \theta) f(\theta)\)
Sum the numerator to get the marginal likelihood
- \(f(y = 3) = \\ \sum_{\theta} f(y = 3 | \theta) f(\theta) \approx 0.2\)
Finally, compute the posterior
- \(f(\theta \mid y = 3) = \frac{f(y = 3 \mid \theta) f(\theta)}{\sum_{\theta} f(y = 3 | \theta) f(\theta)}\)

N <- 5; y <- 3
theta <- c(0.10, 0.30, 0.50, 0.70, 0.90)
prior <- c(0.05, 0.45, 0.30, 0.15, 0.05)
lik <- dbinom(y, N, theta)
lik_x_prior <-  lik * prior
constant <- sum(lik_x_prior)
post <- lik_x_prior / constant

theta	prior	lik	lik_x_prior	post
0.1	0.05	0.01	0.00	0.00
0.3	0.45	0.13	0.06	0.29
0.5	0.30	0.31	0.09	0.46
0.7	0.15	0.31	0.05	0.23
0.9	0.05	0.07	0.00	0.02
Total	1.00	0.83	0.20	1.00

Computing Event Probability

To compute event probabilities, we integrate (or sum) the relevant regions of the parameter space \[ \P(\theta \geq 0.5) = \int_{0.5}^{1} f(\theta \mid y) \, d\theta \]
In this case, we only have discrete quantities, so we sum:

probs <- d |>
  filter(theta >= 0.50) |>
  dplyr::select(prior, post) |>
  colSums() |>
  round(2)
probs

prior  post 
 0.50  0.71

theta	prior	lik	lik_x_prior	post
0.1	0.05	0.01	0.00	0.00
0.3	0.45	0.13	0.06	0.29
0.5	0.30	0.31	0.09	0.46
0.7	0.15	0.31	0.05	0.23
0.9	0.05	0.07	0.00	0.02
Total	1.00	0.83	0.20	1.00

\(\P(\theta \geq 0.50) =\) 0.5 and \(\P(\theta | y \geq 0.50) =\) 0.71

What If We Used a Flat Prior?

Flat or uniform prior means that we consider all values of \(\theta\) equality likely

N <- 5; y <- 3
theta <- c(0.10, 0.30, 0.50, 0.70, 0.90)
prior <- c(0.20, 0.20, 0.20, 0.20, 0.20)
lik <- dbinom(y, N, theta)
lik_x_prior <-  lik * prior
constant <- sum(lik_x_prior)
post <- lik_x_prior / constant

theta	prior	lik	lik_x_prior	post
0.1	0.2	0.01	0.00	0.01
0.3	0.2	0.13	0.03	0.16
0.5	0.2	0.31	0.06	0.37
0.7	0.2	0.31	0.06	0.37
0.9	0.2	0.07	0.01	0.09
Total	1.0	0.83	0.17	1.00

\(\P(\theta \geq 0.50) =\) 0.6 and \(\P(\theta | y \geq 0.50) =\) 0.83

Bayesian Workflow

Overview of the class

Syllabus

Bayesian Inference

Introductions

Two Truths and a Lie1

Two Truths and a Lie

Lecture 1: Bayesian Workflow

Statistics vs. AI/ML

Brief History

Laplace’s Demon

Modern Examples of Bayesian Analyses

Elasticity of Demand (Stan with Hierarchical Model)

Pharmacometrics (Stan with ODEs)

Nonparametric Bayes (Stan with GPs)

Review of Probability

Example

Random Variables Review

Example Simulation

Example Simulation with purrr

Review of Conditional Probability

Joint, Marginal, and Conditional

Joint, Marginal, and Conditional

Law of Total Probability (LOTP)

Bayes’s Rule for Events

Estimating \(\pi\) by Simulation

Estimating \(\pi\) by Simulation

Estimating \(\pi\) by Simulation

Example Bayesian Analysis

Notation

General Approach

Example Prior Model

Data Model

Likelihood Function

Likelihood Function

Compute the Posterior

Computing Event Probability

What If We Used a Flat Prior?

Bayesian Workflow

Overview of the class

Two Truths and a Lie¹