NYU Applied Statistics for Social Science Research
\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathcal{L}} \DeclareMathOperator{\I}{\text{I}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \]
RStanArm
, we can estimate a similar model as a linear regression of log weight on log height; (in Stan, we can write this model directly)RStanArm
centers by default): \(\alpha \sim \text{Normal}(0, 5)\)m3 <- stan_glm(
log_w ~ log_h,
data = d,
family = gaussian,
prior = normal(3, 0.3),
prior_aux = exponential(1),
prior_intercept = normal(0, 2.5),
prior_PD = 1, # don't evaluate the likelihood
seed = 1234,
refresh = 0,
chains = 4,
iter = 600
)
d |>
add_epred_draws(m3, ndraws = 100) |>
ggplot(aes(y = weight, x = height)) +
geom_point(size = 0.5) +
geom_line(aes(y = exp(.epred), group = .draw),
alpha = 0.25, color = 'green') +
xlab("Height") + ylab("Weight") +
ggtitle("Prior Predictive Simulation")
m3 <- stan_glm(
log_w ~ log_h,
data = d,
family = gaussian,
prior = normal(3, 0.3),
prior_aux = exponential(1),
prior_intercept = normal(0, 2.5),
seed = 1234,
refresh = 0,
chains = 4,
iter = 600
)
prior_summary(m3)
Priors for model 'm3'
------
Intercept (after predictors centered)
~ normal(location = 0, scale = 2.5)
Coefficients
~ normal(location = 3, scale = 0.3)
Auxiliary (sigma)
~ exponential(rate = 1)
------
See help('prior_summary.stanreg') for more details
Model Info:
function: stan_glm
family: gaussian [identity]
formula: log_w ~ log_h
algorithm: sampling
sample: 1200 (posterior sample size)
priors: see help('prior_summary')
observations: 544
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) -8.0 0.1 -8.1 -8.0 -7.8
log_h 2.3 0.0 2.3 2.3 2.4
sigma 0.1 0.0 0.1 0.1 0.1
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 3.4 0.0 3.4 3.4 3.5
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 619
log_h 0.0 1.0 621
sigma 0.0 1.0 923
mean_PPD 0.0 1.0 1436
log-posterior 0.1 1.0 529
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
log_h <- seq(0, 5.2, len = 500)
new_data <- tibble(log_h)
pred <- add_predicted_draws(new_data, m3)
pred |>
ggplot(aes(x = exp(log_h), y = exp(.prediction))) +
stat_lineribbon(.width = c(0.90, 0.50), alpha = 0.25) +
xlab("Height (cm)") + ylab("Weight (kg)") + ggtitle("Predictions of Weight from Height") +
geom_point(aes(y = weight, x = height), size = 0.5, alpha = 0.2, data = d)
d <- readr::read_delim("data/winequality-red.csv")
# remove duplicates
d <- d[!duplicated(d), ]
p1 <- ggplot(aes(x = quality), data = d)
p1 <- p1 + geom_histogram() +
ggtitle("Red wine quality ratings")
p2 <- ggplot(aes(quality, alcohol), data = d)
p2 <- p2 +
geom_point(position =
position_jitter(width = 0.2),
size = 0.3)
grid.arrange(p1, p2, nrow = 2)
# A tibble: 6 × 12
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.524 0.932 -1.39 -0.461 -0.246
2 -0.294 1.92 -1.39 0.0566 0.200
3 -0.294 1.26 -1.19 -0.165 0.0785
4 1.66 -1.36 1.47 -0.461 -0.266
5 -0.524 0.713 -1.39 -0.535 -0.266
6 -0.236 0.385 -1.09 -0.683 -0.387
# ℹ 7 more variables: free_sulfur_dioxide <dbl>, total_sulfur_dioxide <dbl>,
# density <dbl>, pH <dbl>, sulphates <dbl>, alcohol <dbl>, quality <dbl>
Model Info:
function: stan_glm
family: gaussian [identity]
formula: quality ~ alcohol
algorithm: sampling
sample: 1000 (posterior sample size)
priors: see help('prior_summary')
observations: 1359
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) 0.1 0.0 0.1 0.1 0.2
alcohol 0.4 0.0 0.4 0.4 0.4
sigma 0.7 0.0 0.7 0.7 0.7
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.1 0.0 0.1 0.1 0.2
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 759
alcohol 0.0 1.0 720
sigma 0.0 1.0 905
mean_PPD 0.0 1.0 862
log-posterior 0.1 1.0 373
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
prior = normal(0, 1)
in RStanArm, every \(\beta\), except for the intercept will be given this priormcmc_areas
library(bayesplot)
d_new <- tibble(alcohol = c(-2, 4))
pred <- m1 |>
posterior_predict(newdata = d_new) |>
data.frame()
colnames(pred) <- c("low_alc", "high_alc")
pred <- tidyr::pivot_longer(pred, everything(),
names_to = "alc",
values_to = "value")
p <- ggplot(aes(x = value),
data = pred)
p + geom_density(aes(fill = alc, color = alc),
alpha = 1/4) +
geom_histogram(aes(x = quality,
y = after_stat(density)),
alpha = 1/2,
data = ds) +
xlab("Quality Score (-2.5, +2.5)") + ylab("")
map_real_number <- function(x) {
if (x < -2) {
return(-2.5)
} else if (x >= -2 && x < -1) {
return(-1.5)
} else if (x >= -1 && x < 0) {
return(-0.5)
} else if (x >= 0 && x < 1) {
return(0.5)
} else if (x >= 1 && x < 2) {
return(1.5)
} else if (x >= 2) {
return(2.5)
}
}
map_real_number <- Vectorize(map_real_number)
yrep_cat <- map_real_number(yrep1) |>
matrix(nrow = nrow(yrep1), ncol = ncol(yrep1))
ppc_dens_overlay(ds$quality,
yrep_cat[sample(nrow(yrep1), 50), ])
Model Info:
function: stan_glm
family: gaussian [identity]
formula: quality ~ .
algorithm: sampling
sample: 1400 (posterior sample size)
priors: see help('prior_summary')
observations: 1359
predictors: 12
Estimates:
mean sd 10% 50% 90%
(Intercept) 0.1 0.0 0.1 0.1 0.1
fixed_acidity 0.0 0.1 0.0 0.0 0.1
volatile_acidity -0.2 0.0 -0.2 -0.2 -0.2
citric_acid 0.0 0.0 -0.1 0.0 0.0
residual_sugar 0.0 0.0 0.0 0.0 0.0
chlorides -0.1 0.0 -0.1 -0.1 -0.1
free_sulfur_dioxide 0.0 0.0 0.0 0.0 0.1
total_sulfur_dioxide -0.1 0.0 -0.1 -0.1 -0.1
density 0.0 0.0 -0.1 0.0 0.0
pH -0.1 0.0 -0.1 -0.1 0.0
sulphates 0.2 0.0 0.1 0.2 0.2
alcohol 0.3 0.0 0.3 0.3 0.4
sigma 0.7 0.0 0.6 0.7 0.7
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.1 0.0 0.1 0.1 0.2
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 1827
fixed_acidity 0.0 1.0 560
volatile_acidity 0.0 1.0 1056
citric_acid 0.0 1.0 1057
residual_sugar 0.0 1.0 926
chlorides 0.0 1.0 1373
free_sulfur_dioxide 0.0 1.0 1025
total_sulfur_dioxide 0.0 1.0 1044
density 0.0 1.0 551
pH 0.0 1.0 722
sulphates 0.0 1.0 1244
alcohol 0.0 1.0 678
sigma 0.0 1.0 1887
mean_PPD 0.0 1.0 1464
log-posterior 0.1 1.0 636
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
sigma
loo_compare
set.seed(123)
n <- 100
a <- 1.5
b <- 0.5
x <- runif(n, -5, 5)
eta <- a + x * b # could be negative
lambda <- exp(eta) # always positive
y <- rpois(n, lambda)
sim <- tibble(y, x, lambda)
p <- ggplot(aes(x, y), data = sim)
p + geom_point(size = 0.5) +
geom_line(aes(y = lambda),
col = 'red',
linewidth = 0.2) +
ggtitle("Simulated Poission Data")
Model Info:
function: stan_glm
family: poisson [log]
formula: y ~ x
algorithm: sampling
sample: 2000 (posterior sample size)
priors: see help('prior_summary')
observations: 100
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) 1.5 0.1 1.4 1.5 1.6
x 0.5 0.0 0.5 0.5 0.5
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 10.7 0.5 10.1 10.7 11.3
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 451
x 0.0 1.0 504
mean_PPD 0.0 1.0 1434
log-posterior 0.0 1.0 667
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
library(latex2exp)
yrep <- posterior_predict(m3)
d <- tibble(y_mu_hat = sqrt(colMeans(yrep)),
y_var = apply(yrep, 2, sd))
p <- ggplot(aes(y_mu_hat, y_var), data = d)
p + geom_point(size = 0.5) +
geom_abline(slope = 1, intercept = 0,
linewidth = 0.2) +
xlab(TeX(r'($\sqrt{\widehat{E(y_i)}}$)')) +
ylab(TeX(r'($\widehat{sd(y_i)}$)'))
n <- 100
a <- 1.5
b <- 0.5
x <- runif(n, -5, 5)
u <- rexp(n, 0.2)
eta <- a + x * b + log(u)
# or <- a + x * b
lambda <- exp(eta)
y <- rpois(n, lambda)
# or rpois(n, u * lambda)
sim_exposure <- tibble(y, x, lambda,
exposure = u)
p <- ggplot(aes(x, y), data = sim_exposure)
p + geom_point(size = 0.5) +
ggtitle("Simulated Poission Data with Exposure")
m4 <- stan_glm(y ~ x,
prior_intercept = normal(0, 1),
prior = normal(0, 1),
family = poisson(link = "log"),
data = sim_exposure, refresh = 0,
iter = 1200)
m5 <- stan_glm(y ~ x,
prior_intercept = normal(0, 1),
prior = normal(0, 1),
family = poisson(link = "log"),
offset = log(exposure),
refresh = 0,
data = sim_exposure, iter = 1200)
yrep_m4 <- posterior_predict(m4)
yrep_m5 <- posterior_predict(m5)
s <- sample(nrow(yrep_m4), 100)
p1 <- ppc_dens_overlay(log(sim_exposure$y + 1),
log(yrep_m4[s, ] + 1)) +
ggtitle("No Exposure (log scale)")
p2 <- ppc_dens_overlay(log(sim_exposure$y + 1),
log(yrep_m5[s, ] + 1)) +
ggtitle("With Exposure (log scale)")
pred4 <- add_predicted_draws(sim_exposure, m4)
pred5 <- add_predicted_draws(sim_exposure, m5,
offset = log(sim_exposure$exposure))
p3 <- pred4 |>
ggplot(aes(x = x, y = .prediction)) +
stat_lineribbon(.width = c(0.90, 0.50),
alpha = 0.25) +
geom_point(aes(x = x, y = y), size = 0.5,
alpha = 0.2)
p4 <- pred5 |>
ggplot(aes(x = x, y = .prediction)) +
stat_lineribbon(.width = c(0.90, 0.50),
alpha = 0.25) +
geom_point(aes(x = x, y = y), size = 0.5,
alpha = 0.2)
roach1
, a treatment
indicator, and senior
indicator for only elderly residents in a buildingexposure2
, a number of days for which the roach traps were used# rescale to make sure coefficients are approximately
# on the same scale
roaches <- roaches |>
mutate(roach100 = roach1 / 100) |>
as_tibble()
head(roaches)
# A tibble: 6 × 6
y roach1 treatment senior exposure2 roach100
<int> <dbl> <int> <int> <dbl> <dbl>
1 153 308 1 0 0.8 3.08
2 127 331. 1 0 0.6 3.31
3 7 1.67 1 0 1 0.0167
4 7 3 1 0 1 0.03
5 0 2 1 0 1.14 0.02
6 0 0 1 0 1 0
rstanarm
, and if we put Normal(3, 1)
, it’s unlikely to be negative, and the number of roaches can be as high as Exp(5) ~ 150
\[
\begin{eqnarray}
y_i & \sim & \text{Poisson}(u_i\lambda_i)\\
\eta_i & = & \alpha + \beta_t x_{it} + \beta_r x_{ir} + \beta_s x_{is} \\
\lambda_i & = & \exp(\eta_i) \\
\alpha & \sim & \text{Normal}(3, 1) \\
\beta & \sim & \text{Normal}(?, \, ?)
\end{eqnarray}
\]m6 <- stan_glm(y ~ roach100 + treatment + senior,
offset = log(exposure2),
prior_intercept = normal(3, 1),
prior = normal(0, 1),
family = poisson(link = "log"),
data = roaches,
iter = 600,
refresh = 0,
prior_PD = 1,
seed = 123)
yrep_m6 <- posterior_predict(m6)
summary(colMeans(yrep_m6))
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.88 42.69 47.71 296.27 57.70 52941.62
m7 <- stan_glm(y ~ roach100 + treatment + senior,
offset = log(exposure2),
prior_intercept = normal(3, 1),
prior = normal(0, 1),
family = poisson(link = "log"),
data = roaches,
iter = 600,
refresh = 0,
seed = 123)
summary(m7)
Model Info:
function: stan_glm
family: poisson [log]
formula: y ~ roach100 + treatment + senior
algorithm: sampling
sample: 1200 (posterior sample size)
priors: see help('prior_summary')
observations: 262
predictors: 4
Estimates:
mean sd 10% 50% 90%
(Intercept) 3.1 0.0 3.1 3.1 3.1
roach100 0.7 0.0 0.7 0.7 0.7
treatment -0.5 0.0 -0.5 -0.5 -0.5
senior -0.4 0.0 -0.4 -0.4 -0.3
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 25.6 0.4 25.1 25.6 26.2
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 899
roach100 0.0 1.0 1074
treatment 0.0 1.0 950
senior 0.0 1.0 1045
mean_PPD 0.0 1.0 1219
log-posterior 0.1 1.0 574
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
yrep_m7 <- posterior_predict(m7)
s <- sample(nrow(yrep_m7), 100)
# on the log scale,
# so we can better see the data
p1 <- ppc_dens_overlay(log(roaches$y + 1),
log(yrep_m7[s, ] + 1))
prop_zero <- function(y) mean(y == 0)
p2 <- pp_check(m7, plotfun = "stat",
stat = "prop_zero",
binwidth = .005)
grid.arrange(p1, p2, nrow = 2)
m8 <- stan_glm(y ~ roach100 + treatment + senior,
offset = log(exposure2),
prior_intercept = normal(3, 1),
prior = normal(0, 1),
prior_aux = exponential(1),
family = neg_binomial_2,
data = roaches,
iter = 600,
refresh = 0,
seed = 123)
summary(m8)
Model Info:
function: stan_glm
family: neg_binomial_2 [log]
formula: y ~ roach100 + treatment + senior
algorithm: sampling
sample: 1200 (posterior sample size)
priors: see help('prior_summary')
observations: 262
predictors: 4
Estimates:
mean sd 10% 50% 90%
(Intercept) 2.8 0.2 2.6 2.8 3.1
roach100 1.2 0.2 1.0 1.2 1.6
treatment -0.7 0.2 -1.0 -0.7 -0.4
senior -0.3 0.3 -0.7 -0.3 0.0
reciprocal_dispersion 0.3 0.0 0.2 0.3 0.3
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 68.8 82.0 24.3 42.5 138.8
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 1742
roach100 0.0 1.0 1708
treatment 0.0 1.0 1204
senior 0.0 1.0 1573
reciprocal_dispersion 0.0 1.0 1604
mean_PPD 2.6 1.0 1017
log-posterior 0.1 1.0 515
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).