[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 2 3 4 5 6 7
[2,] 3 4 5 6 7 8
[3,] 4 5 6 7 8 9
[4,] 5 6 7 8 9 10
[5,] 6 7 8 9 10 11
[6,] 7 8 9 10 11 12
NYU Applied Statistics for Social Science Research
\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathscr{L}} \DeclareMathOperator{\I}{\text{I}} \DeclareMathOperator{\d}{\mathrm{d}} \]
https://tinyurl.com/two-truths-and
Note
“Machine learning excels when you have lots of training data that can be reasonably modeled as exchangeable with your test data; Bayesian inference excels when your data are sparse and your model is dense.” — Andrew Gelman
Classify each of the following: {decision, parameter inference, prediction, causal inference}
Question
Which category is missing data imputation?
Summary of the book The Theory That Would Not Die
We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at any given moment knew all of the forces that animate nature and the mutual positions of the beings that compose it, if this intellect were vast enough to submit the data to analysis, could condense into a single formula the movement of the greatest bodies of the universe and that of the lightest atom; for such an intellect nothing could be uncertain, and the future just like the past would be present before its eyes.
Marquis Pierre Simon de Laplace (1729 — 1827)
“Uncertainty is a function of our ignorance, not a property of the world.”
Left: Pierre Simon Laplace in the style of Wassily Kandinsky, by OpenAI DALL·E
Timonen, J., Mannerström, H., Vehtari, A., & Lähdesmäki, H. (2021)
\[ \P_X(X = x_i) = \P\left(\{\omega_j \in \Omega : X(\omega_j) = x_i\}\right) \]
Random variable \(X(\omega)\) for the number of Heads in two flips
sample()
function to simulate rolls of a die and replicate()
function to repeat the rolling process many timesdie <- 1:6; N <- 1e4
roll <- function(x, n) {
sample(x, size = n, replace = TRUE)
}
roll(die, 2) # roll the die twice
[1] 3 2
rolls <- replicate(N, roll(die, 2))
rownames(rolls) <- c("X1", "X2")
rolls[, 1:10] # print first 10 outcomes
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
X1 2 6 2 4 6 1 2 1 1 4
X2 4 6 4 4 2 4 4 6 4 5
[1] 6 12 6 8 8 5
[1] 0.915
mean()
does the averageWarning
There is something sublte going on: a difference between generating 100 draws from one random variable \(Y\) vs getting one draw from 100 \(Y\)s
This is a more direct demonstration of random draws from a distribution of \(Y = X_1 + X_2\)
\[ F_{X,Y}(x, y) \geq 0 \quad \text{for all } x, y \]
\[ F_{X,Y}(x, y) = \int_{-\infty}^x \int_{-\infty}^y f_{X,Y}(u, v) \, \mathrm{d}v \, \mathrm{d}u \]
\[ \int_{-\infty}^\infty \int_{-\infty}^\infty f_{X,Y}(x, y) \, \mathrm{d}x \, \mathrm{d}y = 1 \]
\[ f_{X,Y}(x, y) = \frac{\partial^2}{\partial x \partial y} F_{X,Y}(x, y) \]
\[ f_Y(y) = \int_{-\infty}^\infty f_{X,Y}(x, y) \, \mathrm{d}x. \]
\[ F_{Y \mid X}(y \mid x) = \P(Y \leq y \mid X = x) \]
\[ f_{X,Y}(x, y) = \frac{1}{2 \pi \sqrt{\det(\Sigma)}} \exp\left( -\frac{1}{2} \begin{bmatrix} x - \mu_X \\ y - \mu_Y \end{bmatrix}^\top \Sigma^{-1} \begin{bmatrix} x - \mu_X \\ y - \mu_Y \end{bmatrix} \right) \]
library(distributional)
sigma_X <- 1
sigma_Y <- 2
rho <- 0.3 # Correlation coefficient
rho_sigmaXY <- rho * sigma_X * sigma_Y # why?
cov_matrix <- matrix(c(sigma_X^2, rho_sigmaXY,
rho_sigmaXY, sigma_Y^2), 2, 2)
dmv_norm <- dist_multivariate_normal(mu = list(c(0, 0)),
sigma = list(cov_matrix))
dimnames(dmv_norm) <- c("x", "y")
variance(dmv_norm)
x y
[1,] 1 4
[[1]]
x y
[1,] 1.0 0.6
[2,] 0.6 4.0
[1] 0
[1] 0.93
List of 1
$ : num [1:1000, 1:2] 0.0231 2.2418 0.9219 0.0766 -1.0797 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "x" "y"
# A tibble: 6 × 2
x y
<dbl> <dbl>
1 0.0231 1.74
2 2.24 2.25
3 0.922 0.556
4 0.0766 -3.37
5 -1.08 -0.600
6 0.0373 -1.41
x y
-0.01 -0.09
x y
x 1.02 0.60
y 0.60 4.26
x y
x 1.00 0.29
y 0.29 1.00
# should be close to cdf(dist, cbind(3, 3)); why?
# both evaluate to P(X and Y < 3)
mean(draws$x < 3 & draws$y < 3)
[1] 0.924
[1] 0.836
How would we do the above analytically? \[ \P(X < 0 \cup Y < 1) = \P(X < 0) + \P(Y < 1) - \P(X < 0 \cap Y < 1) \]
# Compute P(X < 0)
p_X_less_0 <- cdf(dist_normal(mu = 0, sigma = sigma_X), 0) # pnorm(0, mean = 0, sd = sigma_X)
# Compute P(Y < 1)
p_Y_less_1 <- cdf(dist_normal(mu = 0, sigma = sigma_Y), 1) # pnorm(1, mean = 0, sd = sigma_Y)
# Compute P(X < 0, Y < 1)
p_X_less_0_and_Y_less_1 <- cdf(dmv_norm, cbind(0, 1))
# P(X < 0 or Y < 1)
(p_X_less_0 + p_Y_less_1 - p_X_less_0_and_Y_less_1) |> round(2)
[1] 0.8
mean(draws$x < 0 | draws$y < 1)
or the analytical solution?Sex | No | Yes | Total |
---|---|---|---|
Male | 0.620 | 0.167 | 0.787 |
Female | 0.057 | 0.156 | 0.213 |
Total | 0.677 | 0.323 | 1.000 |
Sex | No | Yes | Total |
---|---|---|---|
Male | 0.620 | 0.167 | 0.787 |
Female | 0.057 | 0.156 | 0.213 |
Total | 0.677 | 0.323 | 1.000 |
“Untergang der Titanic”, as conceived by Willy Stöwer, 1912
Let \(A\) be a partition of \(\Omega\), so that each \(A_i\) is disjoint, \(\P(A_i >0)\), and \(\cup A_i = \Omega\). \[ \P(B) = \sum_{i=1}^{n} \P(B \cap A_i) = \sum_{i=1}^{n} \P(B \mid A_i) \P(A_i) \]
Image from Blitzstein and Hwang (2019), Page 55
\[ \P(A \mid B) = \frac{\P(A \cap B)}{\P(B)} = \frac{\P(B \mid A) \P(A)}{\sum_{i=1}^{n} \P(B \mid A_i) \P(A_i)} \]
\[ \underbrace{\color{blue}{f_{\Theta|Y}(\theta \mid y)}}_{\text{Posterior}} = \frac{ \overbrace{ \color{red}{f_{Y|\Theta}(y \mid \theta)} }^{\text{Likelihood}} \cdot \overbrace{\color{green}{f_\Theta(\theta)}}^{ \text{Prior} } }{ \underbrace{\int_{-\infty}^{\infty} f_{Y|\Theta}(y \mid \theta) f_\Theta(\theta) \, \text{d}\theta}_{\text{Prior Predictive } f(y)} } \propto \color{red}{f_{Y|\Theta}(y \mid \theta)} \cdot \color{green}{f_\Theta(\theta)} \]
\[ \begin{align} A_{c}& = \pi r^2 \\ A_{s}& = (2r)^2 = 4r^2 \\ \frac{A_{c}}{A_{s}}& = \frac{\pi r^2}{4r^2} = \frac{\pi}{4} \implies \pi = \frac{4A_{c}}{A_{s}} \end{align} \]
To estimate \(\pi\), we perform the following simulation:
\[ \begin{align} X& \sim \text{Uniform}(-1, 1) \\ Y& \sim \text{Uniform}(-1, 1) \\ \pi& \approx \frac{4 \sum_{i=1}^{N} \I(x_i^2 + y_i^2 < 1)}{N} \end{align} \]
The numerator is a sum over an indicator function \(\I\), which evaluates to \(1\) if the inequality holds and \(0\) otherwise.
For a more complete workflow, see Bayesian Workflow by Gelman et al. (2020)
\[ \begin{eqnarray*} \P(y = \text{red}) &=& \sum_{i=1}^{3} f(y = \text{red}, \theta_i) = \sum_{i=1}^{3} f(y = \text{red} \mid \theta_i) f(\theta_i) \\ &=& (0.25) \cdot \left(\frac{1}{3}\right) + (0.50) \cdot \left(\frac{1}{3}\right) + (0.75) \cdot \left(\frac{1}{3}\right) \end{eqnarray*} \]
Take red ball as success, so \(y\) successes out of \(N = 10\) trials \[ y | \theta \sim \text{Bin}(N,\theta) = \text{Bin}(10,\theta) \]
\(f(y \mid \theta) = \text{Bin} (y \mid 10,\theta) = \binom{10}{y} \theta^y (1 - \theta)^{10 - y}\) for \(y \in \{0,1,\ldots,10\}\)
Is this a valid probability distribution as a function of \(y\)?
Compare with the data model:
\[ \begin{eqnarray*} f(\theta \mid y) &=& \frac{f(y \mid \theta) f(\theta)}{f(y)} = \frac{f(y \mid \theta) f(\theta)}{\sum_{i=1}^{3} f(y, \, \theta_i)} = \frac{f(y \mid \theta) f(\theta)}{\sum_{i=1}^{3} f(y \mid \theta_i) f(\theta_i)} \\ &\propto& f(y \mid \theta) f(\theta) \propto \theta^6 (1 - \theta)^{4} \end{eqnarray*} \]
theta | prior | lik | lxp | post |
---|---|---|---|---|
0.25 | 0.33 | 0.02 | 0.01 | 0.04 |
0.5 | 0.33 | 0.21 | 0.07 | 0.56 |
0.75 | 0.33 | 0.15 | 0.05 | 0.40 |
Total | 1.00 | 0.37 | 0.12 | 1.00 |
\[ \theta^* = \textrm{arg max}_{\theta} \left\{ x_i \times f(\theta_i \mid y) - \text{Cost} \right\} \]
Warning
The bet on the most likely outcome, \(\theta=0.50\) (MLE), is suboptimal!
\[ \begin{eqnarray*} \P(\tilde{y} = \text{red} \mid y) &=& \sum_{i=1}^{3} f(\tilde{y} = \text{red}, \ \theta_i \mid y) = \sum_{i=1}^{3} f(\tilde{y} = \text{red} \mid \theta_i, \ y) f(\theta_i \mid y) \\ &=& \sum_{i=1}^{3} f(\tilde{y} = \text{red} \mid \theta_i) f(\theta_i \mid y) \\ \P(\tilde{y} = \text{red} \mid y) &=& 0.25 \times 0.044 +0.5 \times 0.559 + 0.75 \times 0.397 \\ &\approx& 0.589 \end{eqnarray*} \]
Combine the two cases:
\[ EV_{\text{extra}} = P(\text{red})\times7.593 + P(\text{green})\times3.600. \]
Calculation:
Thus:
\[ EV_{\text{extra}} \approx 4.465 + 1.480 = 5.945. \]
Net Expected Payoff (including costs):
\[ 5.945 - 3 = \$2.945. \]
Decision: It is better to pay $2 and play immediately rather than buying an extra ball.