[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 2 3 4 5 6 7
[2,] 3 4 5 6 7 8
[3,] 4 5 6 7 8 9
[4,] 5 6 7 8 9 10
[5,] 6 7 8 9 10 11
[6,] 7 8 9 10 11 12
NYU Applied Statistics for Social Science Research
\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathscr{L}} \DeclareMathOperator{\I}{\text{I}} \]
https://tinyurl.com/two-truths-and
⚠️ What follows is an oversimplified opinion.
Summary of the book The Theory That Would Not Die
We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at any given moment knew all of the forces that animate nature and the mutual positions of the beings that compose it, if this intellect were vast enough to submit the data to analysis, could condense into a single formula the movement of the greatest bodies of the universe and that of the lightest atom; for such an intellect nothing could be uncertain, and the future just like the past would be present before its eyes.
Marquis Pierre Simon de Laplace (1729 — 1827)
“Uncertainty is a function of our ignorance, not a property of the world.”
OpenAI DALL·E: Pierre Simon Laplace in the style of Wassily Kandinsky
Gelman, A. (2010). Breaking down the 2008 vote. In Atlas of the 2008 Elections.
Joensuu, H., Vehtari, A., et al. (2012). Risk of recurrence of gastrointestinal stromal tumour after surgery: An analysis of pooled population-based cohorts. The Lancet Oncology, 13(3), 265–274.
Random variable X for the number of Heads in two flips
sample()
function to simulate rolls of a die and replicate()
function to repeat the rolling process many timespurrr::map(1:n, \(x) expression)
roll_sums < 11
creates an indicator variablemean()
does the averageList of 20
$ : int [1:2] 1 3
$ : int [1:2] 6 5
$ : int [1:2] 2 5
$ : int [1:2] 5 5
$ : int [1:2] 1 6
$ : int [1:2] 6 3
$ : int [1:2] 6 4
$ : int [1:2] 5 5
$ : int [1:2] 4 1
$ : int [1:2] 5 4
$ : int [1:2] 6 5
$ : int [1:2] 2 5
$ : int [1:2] 4 6
$ : int [1:2] 2 6
$ : int [1:2] 6 5
$ : int [1:2] 4 5
$ : int [1:2] 6 6
$ : int [1:2] 5 6
$ : int [1:2] 3 3
$ : int [1:2] 5 2
List of 20
$ : int 2
$ : int [1:2] 4 3
$ : int [1:3] 5 1 1
$ : int [1:4] 5 6 2 6
$ : int [1:5] 2 2 4 1 1
$ : int [1:6] 1 6 6 5 4 4
$ : int [1:7] 5 3 2 2 4 4 3
$ : int [1:8] 2 6 3 5 3 1 1 1
$ : int [1:9] 2 1 4 1 5 2 6 4 6
$ : int [1:10] 3 6 4 5 4 4 6 3 3 6
$ : int [1:11] 4 1 4 1 5 4 6 3 6 5 ...
$ : int [1:12] 5 2 5 4 2 6 3 4 1 4 ...
$ : int [1:13] 6 5 5 5 4 3 2 5 3 6 ...
$ : int [1:14] 3 5 3 5 1 3 6 6 2 4 ...
$ : int [1:15] 4 6 6 5 1 2 5 5 3 2 ...
$ : int [1:16] 3 2 6 4 2 4 2 2 6 6 ...
$ : int [1:17] 5 2 5 2 5 1 3 4 4 6 ...
$ : int [1:18] 3 6 1 4 2 4 5 4 2 4 ...
$ : int [1:19] 5 5 6 3 3 5 4 1 5 4 ...
$ : int [1:20] 6 2 4 2 3 3 3 1 4 6 ...
Sex | No | Yes | Total |
---|---|---|---|
Male | 0.620 | 0.167 | 0.787 |
Female | 0.057 | 0.156 | 0.213 |
Total | 0.677 | 0.323 | 1.000 |
Sex | No | Yes | Total |
---|---|---|---|
Male | 0.620 | 0.167 | 0.787 |
Female | 0.057 | 0.156 | 0.213 |
Total | 0.677 | 0.323 | 1.000 |
“Untergang der Titanic”, as conceived by Willy Stöwer, 1912
Let \(A\) be a partition of \(\Omega\), so that each \(A_i\) is disjoint, \(\P(A_i >0)\), and \(\cup A_i = \Omega\). \[ \P(B) = \sum_{i=1}^{n} \P(B \cap A_i) = \sum_{i=1}^{n} \P(B \mid A_i) \P(A_i) \]
Image from Blitzstein and Hwang (2019), Page 55
\[ \P(A \mid B) = \frac{\P(B \cap A)}{\P(B)} = \frac{\P(B \mid A) \P(A)}{\sum_{i=1}^{n} \P(B \mid A_i) \P(A_i)} \]
We typically think of \(A\) is some unknown we wish to learn (e.g., the status of a disease) and \(B\) as the data we observe (e.g., the result of a diagnostic test)
We call \(\P(A)\) prior probability of A (e.g., how prevalent is the disease in the population)
We call \(\P(A \mid B)\), the posterior probability of the unknown \(A\) given data \(B\)
The authors calculated the sensitivity and specificity of the Abbott PanBio SARS-CoV-2 rapid antigen test to be 45.4% and 99.8%, respectively. Suppose the prevalence is 0.1%.
\[ \begin{eqnarray} \P(D^+ \mid T^+) = \frac{\P(T^+ \mid D^+) \P(D^+)}{\P(T^+)} & = & \\ \frac{\P(T^+ \mid D^+) \P(D^+)}{\sum_{i=1}^{n}\P(T^+ \mid D^i) \P(D^i) } & = & \\ \frac{\P(T^+ \mid D^+) \P(D^+)}{\P(T^+ \mid D^+) \P(D^+) + \P(T^+ \mid D^-) \P(D^-)} & = & \\ \frac{0.454 \cdot 0.001}{0.454 \cdot 0.001 + 0.002 \cdot 0.999} & \approx & 0.18 \end{eqnarray} \]
Image from Fokko Smits, Martijn Dirksen, and Ivo Schoots: RECIST 1.1 - and more
Greek letters will be used for latent parameters, and English letters will be used for observables.
For a more complete workflow, see Bayesian Workflow by Gelman et al. (2020)
dot_plot <- function(x, y) {
p <- ggplot(data.frame(x, y), aes(x, y))
p + geom_point(aes(x = x, y = y), size = 0.5) +
geom_segment(aes(x = x, y = 0, xend = x,
yend = y), linewidth = 0.2) +
xlab(expression(theta)) +
ylab(expression(f(theta)))
}
theta <- c(0.10, 0.30, 0.50, 0.70, 0.90)
prior <- c(0.05, 0.45, 0.30, 0.15, 0.05)
dot_plot(theta, prior) +
ggtitle("Prior probability of response")
Compare with the data model:
theta | prior | lik | lik_x_prior | post |
---|---|---|---|---|
0.1 | 0.05 | 0.01 | 0.00 | 0.00 |
0.3 | 0.45 | 0.13 | 0.06 | 0.29 |
0.5 | 0.30 | 0.31 | 0.09 | 0.46 |
0.7 | 0.15 | 0.31 | 0.05 | 0.23 |
0.9 | 0.05 | 0.07 | 0.00 | 0.02 |
Total | 1.00 | 0.83 | 0.20 | 1.00 |
To compute event probabilities, we integrate (or sum) the relevant regions of the parameter space \[ \P(\theta \geq 0.5) = \int_{0.5}^{1} f(\theta \mid y) \, d\theta \]
In this case, we only have discrete quantities, so we sum:
theta | prior | lik | lik_x_prior | post |
---|---|---|---|---|
0.1 | 0.05 | 0.01 | 0.00 | 0.00 |
0.3 | 0.45 | 0.13 | 0.06 | 0.29 |
0.5 | 0.30 | 0.31 | 0.09 | 0.46 |
0.7 | 0.15 | 0.31 | 0.05 | 0.23 |
0.9 | 0.05 | 0.07 | 0.00 | 0.02 |
Total | 1.00 | 0.83 | 0.20 | 1.00 |
theta | prior | lik | lik_x_prior | post |
---|---|---|---|---|
0.1 | 0.2 | 0.01 | 0.00 | 0.01 |
0.3 | 0.2 | 0.13 | 0.03 | 0.16 |
0.5 | 0.2 | 0.31 | 0.06 | 0.37 |
0.7 | 0.2 | 0.31 | 0.06 | 0.37 |
0.9 | 0.2 | 0.07 | 0.01 | 0.09 |
Total | 1.0 | 0.83 | 0.17 | 1.00 |
Gelman, A. et al. (2020). Bayesian Workflow. ArXiv:2011.01808 [Stat]. http://arxiv.org/abs/2011.01808