Bayesian Inference

NYU Applied Statistics for Social Science Research

Eric Novik | Spring 2023 | Lecture 1

Introductions

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathscr{L}} \DeclareMathOperator{\I}{\text{I}} \]

Two Truths and a Lie1

  1. One person tells three personal statements, one of which is a lie.
  2. Others discuss and guess which statement is the lie, and they jointly construct a numerical statement of their certainty in the guess (on a 0–10 scale).
  3. The storyteller reveals which was the lie.
  4. Enter the certainty number and the outcome (success or failure) and submit in the Google form. Rotate through everyone in your group so that each person plays the storyteller role once.

Two Truths and a Lie

https://tinyurl.com/two-truths-and

Lecture 1: Bayesian Workflow

  • Two Truths and a Lie
  • Statistics vs AI/ML
  • Brief history of Bayesian inference
  • Review of basic probability
  • Introduction to Bayesian workflow
  • Bayes’s rule for events
  • Binomial model and the Bayesian Crank
  • Overview of the Course

Statistics vs. AI/ML

⚠️ What follows is an oversimplified opinion.

  • AI is great for automating tasks that humans find easy
    • Recognizing faces, cats, and other objects
    • Identifying tumors on a radiology scan
    • Playing Chess and Go
    • Driving a car
    • Post ChatGPT 4: Generating coherent text/images/video, signs of reasoning
  • Statistics is great at answering questions that humans find hard
    • How fast does a drug clear from the body?
    • What is the expected tumor size two months after treatment?
    • How would patients respond under a different treatment?
    • Should I take this drug?
    • Should we (FDA/EMA/…) approve this drug?

Brief History

Summary of the book The Theory That Would Not Die

  • Thomas Bayes (1702(?) — 1761) is credited with the discovery of the “Bayes’s Rule”
  • His paper was published posthumously by Richard Price in 1763
  • Laplace (1749 — 1827) independently discovered the rule and published it in 1774
  • Scientific context: Newton’s Principia was published in 1687
  • Bayesian wins: German Enigma cipher, search for a missing H-bomb, Federalist papers, Moneyball, FiveThirtyEight

Laplace’s Demon

We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at any given moment knew all of the forces that animate nature and the mutual positions of the beings that compose it, if this intellect were vast enough to submit the data to analysis, could condense into a single formula the movement of the greatest bodies of the universe and that of the lightest atom; for such an intellect nothing could be uncertain, and the future just like the past would be present before its eyes.

Marquis Pierre Simon de Laplace (1729 — 1827)

“Uncertainty is a function of our ignorance, not a property of the world.”

Modern Examples of Bayesian Analyses

Vote Share Analysis (Hierarchical Models)

Pharmacometrics (ODE Based Models)

Medical Decision Making (Bayesian Nonparametrics)

Review of Probability

  • A set of all possible outcomes is called a sample space and denoted by \(\Omega\)
  • An outcome of an experiment is denoted by \(\omega \in \Omega\)
  • We typically denote events by capital letters, say \(A \subseteq \Omega\)
  • Axioms of probability:
    • \(\P(A) \geq 0, \, \text{for all } A\)
    • \(\P(\Omega) = 1\)
    • If \(A_1, A_2, A_3, \ldots\) are disjoint: \(\P(\cup_{i=1}^{\infty} A_i) = \sum_{i=1}^{\infty} \P(A_i)\)

Example

  • You roll a fair six-sided die twice
  • Give an example of an \(\omega \in \Omega\)
  • How many elements are in \(\Omega\)? What is \(\Omega\)?
  • Define an event \(A\) as the sum of the two rolls less than 11
  • How many elements are in \(A\)?
  • What is \(\P(A)\)?
outer(1:6, 1:6, FUN = "+")
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    2    3    4    5    6    7
[2,]    3    4    5    6    7    8
[3,]    4    5    6    7    8    9
[4,]    5    6    7    8    9   10
[5,]    6    7    8    9   10   11
[6,]    7    8    9   10   11   12
  • \(\Omega\) consists of all pairs:
    • \(\{(1, 1), (1, 2), ... (2, 1), (2, 2), ...\}\)
  • There are 36 such pairs
  • 33 of those pairs result in a sum of less than 11
  • \(\P(A) = \frac{33}{36} = \frac{11}{12}\)

Random Variables Review

  • Random variable is not random – it is a deterministic mapping from the sample space onto the real line; randomness comes from the experiment
  • PMF, PDF, CDF (Blitzstein and Hwang, Ch. 3, 5)
  • Expectations (Blitzstein and Hwang, Ch. 4)
  • Joint Distributions (Blitzstein and Hwang, Ch. 7)
  • Conditional Expectations (Blitzstein and Hwang, Ch. 9)

Random variable X for the number of Heads in two flips

Example Simulation

  • We will use R’s sample() function to simulate rolls of a die and replicate() function to repeat the rolling process many times
  • Modern approach: purrr::map(1:n, \(x) expression)
die <- 1:6
roll <- function(x, n) {
  sample(x, size = n, replace = TRUE)
}
roll(die, 2) # roll the die twice
[1] 6 3
rolls <- replicate(1e4, roll(die, 2))
rolls[, 1:8] # print first 8 columns
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    6    5    6    6    2    3    6    3
[2,]    3    2    4    4    3    2    6    1
roll_sums <- colSums(rolls)
head(roll_sums)
[1]  9  7 10 10  5  5
mean(roll_sums < 11) 
[1] 0.9151
  • Given a Random Variable \(Y\), \(y^{(1)}, y^{(2)}, y^{(3)}, \ldots, y^N\) are simulations or draws from \(Y\)
  • Fundamental bridge (Blitzstein & Hwang p. 164): \(\P(A) = \E(\I_A)\)
  • Computationally: \(\P(Y < 11) \approx \frac{1}{N} \sum^{N}_{n=1} \I(y^n < 11)\)
  • In R, roll_sums < 11 creates an indicator variable
  • And mean() does the average

Example Simulation with purrr

  • Roll a die twice, twenty times
sim <- purrr::map(1:20, \(x) roll(die, 2))
str(sim)
List of 20
 $ : int [1:2] 1 3
 $ : int [1:2] 6 5
 $ : int [1:2] 2 5
 $ : int [1:2] 5 5
 $ : int [1:2] 1 6
 $ : int [1:2] 6 3
 $ : int [1:2] 6 4
 $ : int [1:2] 5 5
 $ : int [1:2] 4 1
 $ : int [1:2] 5 4
 $ : int [1:2] 6 5
 $ : int [1:2] 2 5
 $ : int [1:2] 4 6
 $ : int [1:2] 2 6
 $ : int [1:2] 6 5
 $ : int [1:2] 4 5
 $ : int [1:2] 6 6
 $ : int [1:2] 5 6
 $ : int [1:2] 3 3
 $ : int [1:2] 5 2
  • Roll a die once, twice, …., twenty times
sim <- purrr::map(1:20, \(x) roll(die, x))
str(sim)
List of 20
 $ : int 2
 $ : int [1:2] 4 3
 $ : int [1:3] 5 1 1
 $ : int [1:4] 5 6 2 6
 $ : int [1:5] 2 2 4 1 1
 $ : int [1:6] 1 6 6 5 4 4
 $ : int [1:7] 5 3 2 2 4 4 3
 $ : int [1:8] 2 6 3 5 3 1 1 1
 $ : int [1:9] 2 1 4 1 5 2 6 4 6
 $ : int [1:10] 3 6 4 5 4 4 6 3 3 6
 $ : int [1:11] 4 1 4 1 5 4 6 3 6 5 ...
 $ : int [1:12] 5 2 5 4 2 6 3 4 1 4 ...
 $ : int [1:13] 6 5 5 5 4 3 2 5 3 6 ...
 $ : int [1:14] 3 5 3 5 1 3 6 6 2 4 ...
 $ : int [1:15] 4 6 6 5 1 2 5 5 3 2 ...
 $ : int [1:16] 3 2 6 4 2 4 2 2 6 6 ...
 $ : int [1:17] 5 2 5 2 5 1 3 4 4 6 ...
 $ : int [1:18] 3 6 1 4 2 4 5 4 2 4 ...
 $ : int [1:19] 5 5 6 3 3 5 4 1 5 4 ...
 $ : int [1:20] 6 2 4 2 3 3 3 1 4 6 ...

Review of Conditional Probability

  • For arbitrary \(A\) and \(B\), if \(\P(B) > 0\):
    • Conditional probability: \(\P(A \mid B) = \frac{\P(AB)}{\P(B)}\)
    • Conditional probability: \(\P(B \mid A) = \frac{\P(AB)}{\P(A)}\)
  • Bayes’s rule: \(\P(A \mid B) = \frac{\P(AB)}{\P(B)} = \frac{\P(B \mid A) \P(A)}{\P(B)}\)
  • \(A\) and \(B\) are independent if observing \(B\) does not give you any more information about \(A\): \(\P(A \mid B) = \P(A)\)
  • And \(\P(A B) = \P(A) \P(B)\) (from Bayes’s rule)

Joint, Marginal, and Conditional

  • Titanic carried approximately 2,200 passengers and sank on 15 April 1912
  • Let \(G: \{m, f\}\) represent Gender and \(S: \{n, y\}\) represent Survival
surv <- apply(Titanic, c(2, 4), sum) |> 
  as_tibble()
surv_prop <- round(surv / sum(surv), 3)
bind_cols(Sex = c("Male", "Female"), 
          surv_prop) |> 
  adorn_totals(c("row", "col")) |>
  knitr::kable(caption = 
               "Titanic survival proportions")
Titanic survival proportions
Sex No Yes Total
Male 0.620 0.167 0.787
Female 0.057 0.156 0.213
Total 0.677 0.323 1.000
  • \(\P(G = m \cap S = y) =\) 0.167
  • \(\P(S = y) = \sum_{i \in \{m, f\}} \P(S = y \, \cap G = i) =\) ??
  • \(\P(S = y) = \sum_{i \in \{m, f\}} \P(S = y \, \cap G = i) = 0.323\)

Joint, Marginal, and Conditional

  • What is \(\P(G = m \mid S = y)\), probability of being male given survival?
  • To compute that, we only consider the column
    where Survival = Yes
  • \(\P(G = m \mid S = y) = \frac{\P(G = m \, \cap \, S = y)}{\P(S = y)} = \frac{0.167}{0.323} \\ \approx 0.52\)
  • You want \(\P(S = y \mid G = m)\), comparing it to \(\P(S = y \mid G = f)\)
  • \(\P(S = y \mid G = m) = \frac{\P(G = m \, \cap \, S = y)}{\P(G = m)} = \frac{0.167}{0.787} \\ \approx 0.21\)
  • \(\P(S = y \mid G = f) = \frac{\P(G = f \, \cap \, S = y)}{\P(G = f)} = \frac{0.156}{0.213} \\ \approx 0.73\)
  • How would you calculate \(\P(S = n \mid G = m)\)?
  • \(\P(S = n \mid G = m) = 1 - \P(S = y \mid G = m)\)
Titanic survival proportions
Sex No Yes Total
Male 0.620 0.167 0.787
Female 0.057 0.156 0.213
Total 0.677 0.323 1.000

“Untergang der Titanic”, as conceived by Willy Stöwer, 1912

Law of Total Probability (LOTP)

Let \(A\) be a partition of \(\Omega\), so that each \(A_i\) is disjoint, \(\P(A_i >0)\), and \(\cup A_i = \Omega\). \[ \P(B) = \sum_{i=1}^{n} \P(B \cap A_i) = \sum_{i=1}^{n} \P(B \mid A_i) \P(A_i) \]

Bayes’s Rule for Events

  • We can combine the definition of conditional probability with the LOTP to come up with Bayes’s rule for events, assuming \(\P(B) \neq 0\)

\[ \P(A \mid B) = \frac{\P(B \cap A)}{\P(B)} = \frac{\P(B \mid A) \P(A)}{\sum_{i=1}^{n} \P(B \mid A_i) \P(A_i)} \]

  • We typically think of \(A\) is some unknown we wish to learn (e.g., the status of a disease) and \(B\) as the data we observe (e.g., the result of a diagnostic test)

  • We call \(\P(A)\) prior probability of A (e.g., how prevalent is the disease in the population)

  • We call \(\P(A \mid B)\), the posterior probability of the unknown \(A\) given data \(B\)

Example: Medical Testing

The authors calculated the sensitivity and specificity of the Abbott PanBio SARS-CoV-2 rapid antigen test to be 45.4% and 99.8%, respectively. Suppose the prevalence is 0.1%.

  • Your child tests positive on this test. What is the probability that she has COVID? That is, we want to know \(\P(D^+ \mid T^+)\)
  • \(\text{Specificity } := \P(T^- \mid D^-) = 0.998\)
  • False positive rate \(\text{FP} := 1 - \text{Specificity } = 1 - \P(T^- \mid D^-) = \P(T^+ \mid D^-) = 0.002\)
  • \(\text{Sensitivity } := \P(T^+ \mid D^+) = 0.454\)
  • False negative rate \(\text{FP} := 1 - \text{Sensitivity } = 1 - \P(T^+ \mid D^+) = \P(T^- \mid D^+) = 0.546\)
  • Prevalence: \(\P(D^+) = 0.001\)

\[ \begin{eqnarray} \P(D^+ \mid T^+) = \frac{\P(T^+ \mid D^+) \P(D^+)}{\P(T^+)} & = & \\ \frac{\P(T^+ \mid D^+) \P(D^+)}{\sum_{i=1}^{n}\P(T^+ \mid D^i) \P(D^i) } & = & \\ \frac{\P(T^+ \mid D^+) \P(D^+)}{\P(T^+ \mid D^+) \P(D^+) + \P(T^+ \mid D^-) \P(D^-)} & = & \\ \frac{0.454 \cdot 0.001}{0.454 \cdot 0.001 + 0.002 \cdot 0.999} & \approx & 0.18 \end{eqnarray} \]

Bayesian Analysis

  • Suppose we enrolled five people in an early cancer clinical trial
  • Each person was given an active treatment
  • From previous trials, we have some idea of historical response rates (proportion of people responding)
  • At the end of the trial, \(Y = y \in \{0,1,...,5 \}\) people will have responded1 to the treatment
  • We are interested in estimating the probability that the response rate is greater or equal to 50%

Image from Fokko Smits, Martijn Dirksen, and Ivo Schoots: RECIST 1.1 - and more

Notation

Greek letters will be used for latent parameters, and English letters will be used for observables.

  • \(\theta\): unknowns or parameters to be estimated; could be multivariate, discrete, and continuous (your book uses \(\pi\))
  • \(y\): observations or measurements to be modelled (\(y_1, y_2, ...\))
  • \(\widetilde{y}\) : unobserved but observable quantities (in your book \(y'\))
  • \(x\): covariates
  • \(f( \theta )\): a prior model, P[DM]F of \(\theta\)
  • \(f_y(y \mid \theta, x)\): an observational model, P[DM]F when it is a function of \(y\) (in your book: \(f(y \mid \pi)\)); we typically drop the \(x\) to simplify the notation.
  • \(f_{\theta}(y \mid \theta)\): is a likelihood function when it is a function of \(\theta\) (in your book: \(\L(\pi \mid y)\))
  • Some people write \(\L(\theta; y)\) or simply \(\L(\theta)\)

General Approach

  • Before observing the data, we need to specify a prior model \(f(\theta)\) on all unknowns \(\theta\)1
  • Pick a data model \(f(y \mid \theta, x)\) — this is typically more important than the prior; this includes the model for the conditional mean: \(\E(y \mid x)\)
  • For more complex models, we construct a prior predictive distribution, \(f(y)\)
    • We will define this quantity later — it will help us assess if our choice of priors makes sense on the observational scale
  • After we observe data \(y\), we treat \(f(y \mid \theta, x)\) as the likelihood of observing \(y\) under all plausible values of \(\theta\), conditioning on \(x\) if necessary
  • Derive a posterior model for \(\theta\), \(\, f(\theta \mid y, x)\) using Bayes’s rule or by simulation
  • Evaluate model quality: 1) quality of the inferences; 2) quality of predictions. Revise the model if necessary
  • Compute all the quantities of interest from the posterior, such as event probabilities, e.g., \(\P(\theta > 0.5)\), posterior predictive distribution \(f(\widetilde{y} \mid y)\), decision functions, etc.

Example Prior Model

  • We will construct a prior model for our clinical trial
  • From previous trials, we construct a discretized version of the prior distribution of the response rate
  • The most likely value for response rate is 30%
dot_plot <- function(x, y) {
  p <- ggplot(data.frame(x, y), aes(x, y))
  p + geom_point(aes(x = x, y = y), size = 0.5) +
    geom_segment(aes(x = x, y = 0, xend = x, 
                     yend = y), linewidth = 0.2) +
    xlab(expression(theta)) + 
    ylab(expression(f(theta)))
}
theta <- c(0.10, 0.30, 0.50, 0.70, 0.90)
prior <- c(0.05, 0.45, 0.30, 0.15, 0.05)
dot_plot(theta, prior) +
  ggtitle("Prior probability of response")

  • Even though \(\theta\) is a continuous parameter, we can still specify a discrete prior
  • The posterior will also be discrete and of the same cardinality

Data Model

  • We consider each person’s response rate to be independent, given the treatment
  • We have a fixed number of people in the trial, \(N = 5\), and \(0\) to \(5\) successes
  • We will therefore consider: \(y | \theta \sim \text{Bin}(N,\theta) = \text{Bin}(5,\theta)\)
  • \(f(y \mid \theta) = \text{Bin} (y \mid 5,\theta) = \binom{5}{y} \theta^y (1 - \theta)^{5 - y}\) for \(y \in \{0,1,\ldots,5\}\)
  • Is this a valid probability distribution as a function of \(y\)?
N = 5; y <- 0:N; theta <- 0.5
(f_y <- dbinom(x = y, size = N, prob = theta) |>
    fractions())
[1] 1/32 5/32 5/16 5/16 5/32 1/32
sum(dbinom(x = y, size = N, prob = theta))
[1] 1

Likelihood Function

  • We ran the trial and observed 3 out of 5 responders
  • We can now construct a likelihood function for \(y = 3\) as a function of \(\theta\)
  • Let’s check if this function is a probability distribution: \[ f (y) = \int_{0}^{1} \binom{N}{y} \theta^y (1 - \theta)^{N - y}\, d\theta = \frac{1}{N + 1} \]
dbinom_theta <- function(theta, N, y) {
  choose(N, y) * theta^y * (1 - theta)^(N - y) 
}
integrate(dbinom_theta, lower = 0, upper = 1, 
          N = 5, y = 3)[[1]] |> fractions()
[1] 1/6
  • \(f(y)\) is called marginal distribution of the data or, more aptly, prior predictive distribution
  • It tells us that prior to observing \(y\), all values of \(y\) are equally likely

Likelihood Function

Compare with the data model:

Compute the Posterior

  • Compute the likelihood
    • \(\binom{5}{3} \theta^3 (1 - \theta)^{2}\) for \(\theta \in \{0.10, 0.30, 0.50, 0.70, 0.90\}\)
  • Multiply the likelihood by the prior to compute the numerator
    • \(f(y = 3 \mid \theta) f(\theta)\)
  • Sum the numerator to get the marginal likelihood
    • \(f(y = 3) = \sum_{\theta} f(y = 3 | \theta) f(\theta) \approx 0.2\)
  • Finally, compute the posterior
    • \(f(\theta \mid y = 3) = \frac{f(y = 3 \mid \theta) f(\theta)}{\sum_{\theta} f(y = 3 | \theta) f(\theta)}\)
N <- 5; y <- 3
theta <- c(0.10, 0.30, 0.50, 0.70, 0.90)
prior <- c(0.05, 0.45, 0.30, 0.15, 0.05)
lik <- dbinom(y, N, theta)
lik_x_prior <-  lik * prior
constant <- sum(lik_x_prior)
post <- lik_x_prior / constant


theta prior lik lik_x_prior post
0.1 0.05 0.01 0.00 0.00
0.3 0.45 0.13 0.06 0.29
0.5 0.30 0.31 0.09 0.46
0.7 0.15 0.31 0.05 0.23
0.9 0.05 0.07 0.00 0.02
Total 1.00 0.83 0.20 1.00

Computing Event Probability

  • To compute event probabilities, we integrate (or sum) the relevant regions of the parameter space \[ \P(\theta \geq 0.5) = \int_{0.5}^{1} f(\theta \mid y) \, d\theta \]

  • In this case, we only have discrete quantities, so we sum:

probs <- d |>
  filter(theta >= 0.50) |>
  dplyr::select(prior, post) |>
  colSums() |>
  round(2)
probs
prior  post 
 0.50  0.71 
theta prior lik lik_x_prior post
0.1 0.05 0.01 0.00 0.00
0.3 0.45 0.13 0.06 0.29
0.5 0.30 0.31 0.09 0.46
0.7 0.15 0.31 0.05 0.23
0.9 0.05 0.07 0.00 0.02
Total 1.00 0.83 0.20 1.00


  • \(\P(\theta \geq 0.50) =\) 0.5 and \(\P(\theta | y \geq 0.50) =\) 0.71

What If We Used a Flat Prior?

  • Flat or uniform prior means that we consider all values of \(\theta\) equality likely
N <- 5; y <- 3
theta <- c(0.10, 0.30, 0.50, 0.70, 0.90)
prior <- c(0.20, 0.20, 0.20, 0.20, 0.20)
lik <- dbinom(y, N, theta)
lik_x_prior <-  lik * prior
constant <- sum(lik_x_prior)
post <- lik_x_prior / constant
theta prior lik lik_x_prior post
0.1 0.2 0.01 0.00 0.01
0.3 0.2 0.13 0.03 0.16
0.5 0.2 0.31 0.06 0.37
0.7 0.2 0.31 0.06 0.37
0.9 0.2 0.07 0.01 0.09
Total 1.0 0.83 0.17 1.00
  • \(\P(\theta \geq 0.50) =\) 0.6 and \(\P(\theta | y \geq 0.50) =\) 0.83

Bayesian Workflow

Overview of the class