[1] 1 4
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
X1 5 1 2 6 5 5 4 6 1 3
X2 1 3 3 4 2 3 2 6 5 4
[1] 6 4 5 10 7 8
[1] 0.9148
NYU Applied Statistics for Social Science Research

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathscr{L}} \DeclareMathOperator{\I}{\text{I}} \DeclareMathOperator{\d}{\mathrm{d}} \]


Note
“Machine learning excels when you have lots of training data that can be reasonably modeled as exchangeable with your test data; Bayesian inference excels when your data are sparse and your model is dense.” — Andrew Gelman
Classify each of the following: {decision, parameter inference, prediction, causal inference}

Question
Which category is missing data imputation?
Summary of the book The Theory That Would Not Die

We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at any given moment knew all of the forces that animate nature and the mutual positions of the beings that compose it, if this intellect were vast enough to submit the data to analysis, could condense into a single formula the movement of the greatest bodies of the universe and that of the lightest atom; for such an intellect nothing could be uncertain, and the future just like the past would be present before its eyes.

Marquis Pierre Simon de Laplace (1749 — 1827)
“Uncertainty is a function of our ignorance, not a property of the world.”


Left: Pierre Simon Laplace in the style of Wassily Kandinsky, by OpenAI DALL·E
Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming, Riutort-Mayol et al. (2022)
sample() function to simulate rolls of a die and replicate() function to repeat the rolling process many times[1] 1 4
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
X1 5 1 2 6 5 5 4 6 1 3
X2 1 3 3 4 2 3 2 6 5 4
[1] 6 4 5 10 7 8
[1] 0.9148
mean() does the averageThis is a more direct demonstration of random draws from a distribution of \(Y = X_1 + X_2\)
Let \(A\) be a partition of \(\Omega\), so that each \(A_i\) is disjoint, \(\P(A_i >0)\), and \(\cup A_i = \Omega\). \[ \P(B) = \sum_{i=1}^{n} \P(B \cap A_i) = \sum_{i=1}^{n} \P(B \mid A_i) \P(A_i) \]
Image from Blitzstein and Hwang (2019), Page 55
Suppose you forgot the law of iterated expectations (Adam’s Law) which states \[ \E(Y) = \E_X(\E_{Y|X}(Y \mid X)) \]
Use the following AI prompts
library(purrr)
n <- 20
class_sizes <- sample(n)
mean_gpa_vec <- runif(n = n, 2.8, 3.2)
N <- sum(class_sizes)
W <- class_sizes / N
G <- class_sizes |>
map(\(x) rnorm(n = x, mean = mean_gpa_vec, sd = 1))
# E(Y)
EY <- unlist(G) |> mean()
print(EY)
# E(E(Y | X))
class_means <- map_dbl(G, mean) # can't just average those
gpaXprob <- class_means * W
EEYX <- sum(gpaXprob)
print(EEYX)\[ F_{X,Y}(x, y) \geq 0 \quad \text{for all } x, y \]
\[ F_{X,Y}(x, y) = \int_{-\infty}^x \int_{-\infty}^y f_{X,Y}(u, v) \, \mathrm{d}v \, \mathrm{d}u \]
\[ \int_{-\infty}^\infty \int_{-\infty}^\infty f_{X,Y}(x, y) \, \mathrm{d}x \, \mathrm{d}y = 1 \]
\[ f_{X,Y}(x, y) = \frac{\partial^2}{\partial x \partial y} F_{X,Y}(x, y) \]
\[ f_Y(y) = \int_{-\infty}^\infty f_{X,Y}(x, y) \, \mathrm{d}x. \]
\[ F_{Y \mid X}(y \mid x) = \P(Y \leq y \mid X = x) \]

\[ f_{X,Y}(x, y) = \frac{1}{2 \pi \sqrt{\det(\Sigma)}} \exp\left( -\frac{1}{2} \begin{bmatrix} x - \mu_X \\ y - \mu_Y \end{bmatrix}^\top \Sigma^{-1} \begin{bmatrix} x - \mu_X \\ y - \mu_Y \end{bmatrix} \right) \]
\[ \Sigma = \begin{bmatrix} \sigma_X^2 & \rho\,\sigma_X\sigma_Y \\ \rho\,\sigma_X\sigma_Y & \sigma_Y^2 \end{bmatrix}, \quad \rho = \frac{\operatorname{Cov}(X, Y)}{\sigma_X \, \sigma_Y}, \quad -1 \leq \rho \leq 1 \]
library(distributional)
sigma_X <- 1
sigma_Y <- 2
rho <- 0.3 # Correlation coefficient
rho_sigmaXY <- rho * sigma_X * sigma_Y # why?
cov_matrix <- matrix(c(sigma_X^2, rho_sigmaXY,
rho_sigmaXY, sigma_Y^2), 2, 2)
dmv_norm <- dist_multivariate_normal(mu = list(c(0, 0)),
sigma = list(cov_matrix))
dimnames(dmv_norm) <- c("x", "y")
variance(dmv_norm) x y
[1,] 1 4
[[1]]
x y
[1,] 1.0 0.6
[2,] 0.6 4.0
[1] 0
[1] 0.93
List of 1
$ : num [1:10000, 1:2] 0.956 1.422 0.175 0.597 -1.106 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "x" "y"
# A tibble: 6 × 2
x y
<dbl> <dbl>
1 0.956 1.75
2 1.42 1.68
3 0.175 -1.70
4 0.597 2.71
5 -1.11 -2.10
6 -0.826 -1.95
x y
-0.01 0.00
x y
x 0.99 0.60
y 0.60 3.95
x y
x 1.0 0.3
y 0.3 1.0
[1] 0.93
[1] 0.8
How would we do the above analytically? \[ \P(X < 0 \cup Y < 1) = \P(X < 0) + \P(Y < 1) - \P(X < 0 \cap Y < 1) \]
# Compute P(X < 0)
p_X_less_0 <- cdf(dist_normal(mu = 0, sigma = sigma_X), 0) # pnorm(0, mean = 0, sd = sigma_X)
# Compute P(Y < 1)
p_Y_less_1 <- cdf(dist_normal(mu = 0, sigma = sigma_Y), 1) # pnorm(1, mean = 0, sd = sigma_Y)
# Compute P(X < 0, Y < 1)
p_X_less_0_and_Y_less_1 <- cdf(dmv_norm, cbind(0, 1))
# P(X < 0 or Y < 1)
(p_X_less_0 + p_Y_less_1 - p_X_less_0_and_Y_less_1) |> round(2)[1] 0.8
mean(draws$x < 0 | draws$y < 1) or the analytical solution?\[ \P(A \mid B) = \frac{\P(A \cap B)}{\P(B)} = \frac{\P(B \mid A) \P(A)}{\sum_{i=1}^{n} \P(B \mid A_i) \P(A_i)} \]
\[ \underbrace{\color{blue}{f_{\Theta|Y}(\theta \mid y)}}_{\text{Posterior}} = \frac{ \overbrace{ \color{red}{f_{Y|\Theta}(y \mid \theta)} }^{\text{Likelihood}} \cdot \overbrace{\color{green}{f_\Theta(\theta)}}^{ \text{Prior} } }{ \underbrace{\int_{-\infty}^{\infty} f_{Y|\Theta}(y \mid \theta) f_\Theta(\theta) \, \text{d}\theta}_{\text{Prior Predictive } f(y)} } \propto \color{red}{f_{Y|\Theta}(y \mid \theta)} \cdot \color{green}{f_\Theta(\theta)} \]
For a more complete workflow, see Bayesian Workflow by Gelman et al. (2020)