NYU Applied Statistics for Social Science Research
\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathcal{L}} \DeclareMathOperator{\I}{\text{I}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \]
Poisson distribution arises when we are interested in counts
In practice, we rarely use Poisson due to its restrictive nature
Poisson RV \(Y\) is paremeterized with a rate of occurance \(\lambda\): \(Y|\lambda \sim \text{Pois}(\lambda)\)
\[
f(y \mid \lambda) = \frac{e^{-\lambda} \lambda^y}{y!}\;\; \text{ for } y \in \{0,1,2,\ldots\}
\]
Notice, the Tailor series for \(e^\lambda = \sum_{y=0}^{\infty} \frac{\lambda^y}{y!}\) immediately validates that \(f\) is a PDF
Also, \(\E(Y | \lambda) = \V(Y | \lambda) = \lambda\), which is the restrictive case mentioned about — real count data seldom satisfied this constraint
Assume we observe \(Y_1, Y_2, ..., Y_n\) independant Poisson random variables
The joint lilelihood, a function of the parameter \(\lambda\) for \(y_i \in \mathbb{Z}^+\) and \(\lambda > 0\), is given by: \[ \begin{eqnarray} f(y \mid \lambda) & = & \prod_{i=1}^n f(y_i \mid \lambda) = f(y_1 \mid \lambda) f(y_2 \mid \lambda) \cdots f(y_n \mid \lambda) = \prod_{i=1}^{n}\frac{\lambda^{y_i}e^{-\lambda}}{y_i!} \\ & = &\frac{\lambda^{y_1}e^{-\lambda}}{y_1!} \cdot \frac{\lambda^{y_2}e^{-\lambda}}{y_2!} \cdots \frac{\lambda^{y_n}e^{-\lambda}}{y_n!} \\ & = &\frac{\left(\lambda^{y_1}\lambda^{y_2} \cdots \lambda^{y_n}\right) \left(e^{-\lambda}e^{-\lambda} \cdots e^{-\lambda}\right)}{y_1! y_2! \cdots y_n!} \\ & = &\frac{\lambda^{\sum_{i=1}^{n} y_i}e^{-n\lambda}}{\prod_{i=1}^n y_i!} \propto \lambda^{\sum_{i=1}^{n} y_i}e^{-n\lambda} \end{eqnarray} \]
We call \(\sum_{i=1}^{n} y_i\) a sufficient statistic
The likelihood has a form of \(\lambda^{a} e^{-b\lambda}\) so we expect the conjugate prior to be of the same form
Gamma PDF satisfied this condition: \(f(\lambda) \propto \lambda^{\alpha - 1} e^{-\beta\lambda}\)
Matching up the exponents, we can interpret \((\alpha - 1)\) as the total number of incidents \(\sum y_i\) out of \(\beta\) prior observations
Full Gamma density is \(\text{Gamma}(\lambda|\alpha,\beta)=\frac{\beta^{\alpha}} {\Gamma(\alpha)} \, \lambda^{\alpha - 1}e^{-\beta \, \lambda}\) for \(\lambda, \alpha, \beta \in \mathbb{R}^+\) \[ \begin{equation} \begin{split} \E(\lambda) & = \frac{\alpha}{\beta} \\ \text{Mode}(\lambda) & = \frac{\alpha - 1}{\beta} \;\; \text{ for } \alpha \ge 1 \\ \V(\lambda) & = \frac{\alpha}{\beta^2} \\ \end{split} \end{equation} \]
When \(\alpha = 1\), \(\lambda \sim \text{Dist}(\beta)\). What is \(\text{Dist}\)? Work in pairs.
Prior is \(f(\lambda) \propto \lambda^{\alpha - 1} e^{-\beta\lambda}\)
Likelihood is \(f(y | \lambda) \propto \lambda^{\sum_{i=1}^{n} y_i}e^{-n\lambda}\) \[ \begin{eqnarray} f(\lambda \mid y) & \propto & \text{prior} \cdot \text{likelihood} \\ & = & \lambda^{\alpha - 1} e^{-\beta\lambda} \cdot \lambda^{\sum_{i=1}^{n} y_i}e^{-n\lambda} \\ & = & \lambda^{\alpha + \sum_{i=1}^{n} y_i - 1} \cdot e^{-\beta\lambda - n\lambda} \\ & = & \lambda^{\color{red}{\alpha + \sum_{i=1}^{n} y_i} - 1} \cdot e^{-\color{red}{(\beta + n)} \lambda} \\ f(\lambda \mid y) & = & \text{Gamma}\left( \alpha + \sum_{i=1}^{n} y_i, \, \beta + n \right) \end{eqnarray} \]
As before, we can match the Gamma kernel without doing the integration
Posterior mean and variance follow from the updated prior mean and variance:
\[ \begin{eqnarray} \E(\lambda \mid y) &=& \frac{\alpha'}{\beta'} = \frac{\alpha + \sum_{i=1}^{n} y_i}{\beta + n} \\ \V(\lambda \mid y) &=& \frac{\alpha'}{(\beta')^2} = \frac{\alpha + \sum_{i=1}^{n} y_i}{(\beta + n)^2} \end{eqnarray} \]
Year | GC | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C14 | C15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1875 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
1876 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
1877 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 2 | 0 |
1878 | 1 | 2 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1879 | 0 | 0 | 0 | 1 | 1 | 2 | 2 | 0 | 1 | 0 | 0 | 2 | 1 | 0 |
1880 | 0 | 3 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 2 | 1 | 4 | 3 | 0 |
1881 | 1 | 0 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
1882 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 2 | 1 | 4 | 1 |
# this flattens the data frame into a vector
dd <- unlist(d[, -1])
p <- ggplot(aes(y), data = data.frame(y = dd))
p1 <- p + geom_histogram() +
xlab("Number of deaths reported") + ylab("") +
ggtitle("Total reported counts of deaths")
p2 <- p + geom_histogram(aes(y = ..count../sum(..count..))) +
xlab("Number of deaths reported") + ylab("") +
ggtitle("Proportion of reported counts of deaths")
grid.arrange(p1, p2, nrow = 1)
plot_xbar <- function(n_repl, n_samples) {
x <- seq(0.6, 1.4, len = 100)
xbar <- replicate(n_repl,
mean(rexp(n_samples, rate = 1)))
mu <- dnorm(x, mean = 1, sd = 1/sqrt(n_samples))
p <- ggplot(aes(x = xbar),
data = tibble(xbar))
p + geom_histogram(aes(y = ..density..),
bins = 30, alpha = 0.6) +
geom_line(aes(x = x, y = mu),
color = 'red',
linewidth = 0.3,
data = tibble(x, y)) +
ylab("") + theme(axis.text.y = element_blank())
}
p1 <- plot_xbar(1e4, 100) +
ggtitle("Sampling means from rexp(100, 1)")
p2 <- plot_xbar(1e4, 300) +
ggtitle("Sampling means from rexp(300, 1)")
grid.arrange(p1, p2, nrow = 2)
After observing data \(y\), we can compute the joint normal likelihood, assuming \(\sigma\) is known \[ \begin{eqnarray} f(y \mid \mu) & = & \prod_{i=1}^{n}\frac{1}{\sqrt{2 \pi} \ \sigma} \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \mu}{\sigma} \right)^2 \right) \\ & \propto & \prod_{i=1}^{n} \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \mu}{\sigma} \right)^2 \right) \\ & = & \exp \left( {-\frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^2}}\right) \\ &\propto& \exp\left({-\frac{(\bar{y}-\mu)^2}{2\sigma^2/n}}\right) \end{eqnarray} \]
The last line is derived by expanding the square and dropping terms that don’t depend on \(\mu\); \(\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i\)
\[ \begin{eqnarray} f(y \mid \mu) & \propto & \exp\left( - \, \frac{1}{2} \left( \frac{y - \mu}{\sigma} \right)^2 \right) \\ f(\mu) & \propto & \exp\left( - \, \frac{1}{2} \left( \frac{\mu - \theta}{\tau} \right)^2 \right) \\ f(\mu \mid y) & \propto & \exp\left( - \, \frac{1}{2} \left( \frac{\mu - \theta}{\tau} \right)^2 - \frac{1}{2} \left( \frac{y - \mu}{\sigma} \right)^2 \right) \\ & = & \exp \left( -\frac{1}{2} \left( \frac{(\mu - \theta)^2}{\tau^2} + \frac{(y - \mu)^2}{\sigma^2} \right)\right) \\ & = & \exp \left( -\frac{1}{2\tau_1^2} \left( \mu - \mu_1 \right)^2 \right) \end{eqnarray} \]
For one observation: \[ \begin{eqnarray} f(\mu \mid y) &\propto& \exp \left( -\frac{1}{2\tau_1^2} \left( \mu - \mu_1 \right)^2 \right) \text{ i.e. } \mu \mid y \sim \text{Normal}(\mu_1, \tau_1) \\ \mu_1 &=& \frac{\frac{1}{\tau^2} \theta + \frac{1}{\sigma^2} y}{\frac{1}{\tau^2} + \frac{1}{\sigma^2}} \\ \frac{1}{\tau_1^2} &=& \frac{1}{\tau^2} + \frac{1}{\sigma^2} \end{eqnarray} \]
For multiple observations: \[ \mu_1 = \frac{\frac{1}{\tau^2} \theta + \frac{n}{\sigma^2} \overline{y}}{\frac{1}{\tau^2} + \frac{n}{\sigma^2}} \\ \frac{1}{\tau_1^2} = \frac{1}{\tau^2} + \frac{n}{\sigma^2} \]
Arianna Rosenbluth Dies at 93, The New York Times
lp <- function(theta, data) {
# log(theta) * sum(data$y) + log(1 - theta) * sum(1 - data$y)
lp <- 0
for (i in 1:data$N) {
lp <- lp + log(theta) * data$y[i] + log(1 - theta) * (1 - data$y[i])
}
return(lp)
}
data <- list(N = 5, y = c(0, 1, 1, 0, 1))
# generate theta parameter grid
theta <- seq(0.01, 0.99, len = 100)
# compute log likelihood and prior for every value of the grid
log_lik <- lp(theta, data); log_prior <- log(dbeta(theta, 1, 1))
# compute log posterior
log_post <- log_lik + log_prior
# convert back to the original scale and normalize
post <- exp(log_post); post <- post / sum(post)
# sample theta in proportion to the posterior probability
draws <- sample(theta, size = 1e5, replace = TRUE, prob = post)
We already saw a special case of MC integration
Suppose we can draw samples from PDF \(f\), \((\theta^{(1)}, \theta^{(2)},..., \theta^{(N)})\)
If we want to compute an expectation of some function \(h\): \[ \begin{eqnarray} \E_f[h(\theta)] &=& \int h(\theta)f(\theta)\, d\theta &\approx& \frac{1}{N} \sum_{i = 1}^{N} h \left( \theta^{(i)} \right) \end{eqnarray} \]
The Law of Large Numbers tells us that these approximations improve with \(N\)
In practice, the challenge is obtaining draws from \(f\)
That’s where MCMC comes in
library(cmdstanr)
m1 <- cmdstan_model("stan/01-bernoulli.stan") # compile the model
data <- list(N = 5, y = c(0, 1, 1, 0, 1))
f1 <- m1$sample( # for other options to sample, help(sample)
data = data, # pass data as a list, match the vars name to Stan
seed = 123, # to reproduce results, Stan does not rely on R's seed
chains = 4, # total chains, the more, the better
parallel_chains = 4, # for multi-processor CPUs
refresh = 0, # number of iterations printed on the screen
iter_warmup = 500, # number of draws for warmup (per chain)
iter_sampling = 500 # number of draws for samples (per chain)
)
Running MCMC with 4 parallel chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.2 seconds.
# A tibble: 2 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -5.27 -4.99 0.683 0.291 -6.67 -4.78 1.00 865. 1012.
2 theta 0.579 0.584 0.167 0.182 0.288 0.843 1.00 749. 933.
# A tibble: 6 × 5
# Groups: .variable [1]
.chain .iteration .draw .variable .value
<int> <int> <int> <chr> <dbl>
1 4 495 1995 lp__ -5.27
2 4 496 1996 lp__ -4.90
3 4 497 1997 lp__ -4.78
4 4 498 1998 lp__ -4.90
5 4 499 1999 lp__ -4.90
6 4 500 2000 lp__ -5.70
# A tibble: 2 × 2
.variable mean
<chr> <dbl>
1 lp__ -5.27
2 theta 0.579
# A tibble: 2 × 7
.variable .value .lower .upper .width .point .interval
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 lp__ -4.99 -6.67 -4.78 0.9 median qi
2 theta 0.584 0.288 0.843 0.9 median qi
# A tibble: 6 × 5
.chain .iteration .draw theta lp__
<int> <int> <int> <dbl> <dbl>
1 4 495 1995 0.386 -5.27
2 4 496 1996 0.661 -4.90
3 4 497 1997 0.576 -4.78
4 4 498 1998 0.477 -4.90
5 4 499 1999 0.659 -4.90
6 4 500 2000 0.322 -5.70
theta <- seq(0.01, 0.99, len = 100)
p <- ggplot(aes(theta), data = draws)
p + geom_histogram(aes(y = after_stat(density)),
bins = 30, alpha = 0.6) +
geom_line(aes(theta, beta_dens),
linewidth = 0.5, color = 'red',
data = tibble(theta, beta_dens)) +
ylab("") + xlab(expression(theta)) +
ggtitle("Posterior draws from Stan")
library(bayesplot)
draws <- f1$draws("theta")
color_scheme_set("viridis")
p1 <- mcmc_trace(draws, pars = "theta") + ylab(expression(theta)) +
ggtitle("Good Chain")
bad_post <- readRDS("data/bad_post.rds")
bad_draws <- bad_post$draws("mu")
p2 <- mcmc_trace(bad_draws, pars = "mu") + ylab(expression(theta)) +
ggtitle("Bad Chain")
grid.arrange(p1, p2, nrow = 2)
Autocorrelation assesses the quality of a single chain
Split R-Hat estimates the extent to which the chains are consistent with one another
It does it by assessing the mixing of chains by comparing variances within and between chains (technically sequences, as chains are split up) \[ \begin{eqnarray} \hat{R} = \sqrt{\frac{\frac{n-1}{n}\V_W + \frac{1}{n}\V_B}{\V_W}} = \sqrt{\frac{\V_{\text{total}}}{\V_{W}}} \end{eqnarray} \]
\(\V_W\) is within chain variance and \(\V_B\) is between chain variance
We don’t like to see R-hats greater than 1.02 and really don’t like them greater than 1.05