SMaC: Statistics, Math, and Computing

Applied Statistics for Social Science Research

Eric Novik | Summer 2023 | Session 07

Session 8 Outline

Statistical analysis workflow
Introduction to the rstanarm package
Setting up a linear regression
Model evaluation
Model expansion

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathscr{L}} \DeclareMathOperator{\I}{\text{I}} \]

Analysis Workflow

Following is a high-level, end-to-end view from “R for Data Science” by Wickham and Grolemund

Analysis Workflow

A more comprehensive “Bayesian Workflow” by Gelman at al.

Some Notation

Observed data \(y\)
Unobserved but observable data \(\widetilde{y}\)
Unobservable parameters \(\theta\)
Covariates \(X\)
Prior distribution \(f(\theta)\)
Likelihood (as a function of \(\theta\)), \(f(y | \theta, X)\)
Posterior distribution \(f(\theta | y, X)\) (for Bayesian inference only)

Bayes vs Frequentist Inference

Estimation is the process of figuring out the unknowns, i.e., unobserved quantities
In frequentist inference, the problem is framed in terms of the most likely value(s) of \(\theta\)
Bayesians want to characterize the whole distribution, a much more ambitious goal

Suppose we want to characterize the following function, which represents some distribution of the unknown parameter:

Wine Dataset

d <- read.delim("data/winequality-red.csv", sep = ";")
dim(d)

[1] 1599   12

d <- d[!duplicated(d), ] # remove the duplicates
dim(d)

[1] 1359   12

knitr::kable(head(d))

	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
1	7.4	0.70	0.00	1.9	0.076	11	34	0.9978	3.51	0.56	9.4	5
2	7.8	0.88	0.00	2.6	0.098	25	67	0.9968	3.20	0.68	9.8	5
3	7.8	0.76	0.04	2.3	0.092	15	54	0.9970	3.26	0.65	9.8	5
4	11.2	0.28	0.56	1.9	0.075	17	60	0.9980	3.16	0.58	9.8	6
6	7.4	0.66	0.00	1.8	0.075	13	40	0.9978	3.51	0.56	9.4	5
7	7.9	0.60	0.06	1.6	0.069	15	59	0.9964	3.30	0.46	9.4	5

Scaling the Data

ds <- scale(d) # subtract the mean and divide by sd
knitr::kable(head(ds) %>% round(2))

	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
1	-0.52	0.93	-1.39	-0.46	-0.25	-0.47	-0.38	0.58	1.29	-0.58	-0.95	-0.76
2	-0.29	1.92	-1.39	0.06	0.20	0.87	0.60	0.05	-0.71	0.12	-0.58	-0.76
3	-0.29	1.26	-1.19	-0.17	0.08	-0.09	0.21	0.16	-0.32	-0.05	-0.58	-0.76
4	1.66	-1.36	1.47	-0.46	-0.27	0.11	0.39	0.69	-0.97	-0.46	-0.58	0.46
6	-0.52	0.71	-1.39	-0.53	-0.27	-0.28	-0.20	0.58	1.29	-0.58	-0.95	-0.76
7	-0.24	0.39	-1.09	-0.68	-0.39	-0.09	0.36	-0.17	-0.06	-1.16	-0.95	-0.76

check <- apply(ds, 2, function(x) c(mean = mean(x), sd = sd(x))) %>% round(2)
knitr::kable(check)

	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
mean	0	0	0	0	0	0	0	0	0	0	0	0
sd	1	1	1	1	1	1	1	1	1	1	1	1

class(ds); ds <- as_tibble(ds)

[1] "matrix" "array"

Quality Rating of Red Wine

Unscaled Quality

qplot(d$quality, geom = "histogram") + xlab("Quality Rating of Red Wine")

Scaled Quality

qplot(ds$quality, geom = "histogram") + xlab("Quality Rating of Red Wine")

Alcohol

p <- ggplot(ds, aes(alcohol, quality))
p + geom_jitter(width = 0.1, height = 0.2, alpha = 1/5) + xlab("Alcohol") + ylab("Quality (Jittered)")

Volatile Acidity

p <- ggplot(ds, aes(volatile.acidity, quality))
p + geom_jitter(width = 0.1, height = 0.2, alpha = 1/5) + xlab("Volatile Acidity") + ylab("Quality (Jittered)")

Correlations

library(corrplot)
M <- cor(ds)
corrplot(M, method = 'ellipse', order = 'AOE', type = 'upper')

Linear Regression

We will fit a linear regression to the scaled wine dataset
The priors come from (sensible) defaults in rstanarm
Linear regression can be specified in the following way, where \(X\) is the design matrix with one or more scaled predictors and \(y\) is the scaled quality score

\[ \begin{aligned} y &\sim \mathrm{Normal}(\alpha + X \beta, \ \sigma) \\ \alpha &\sim \mathrm{Normal}(0, \ 2.5) \\ \beta &\sim \mathrm{Normal}(0, \ 2.5) \\ \sigma &\sim \mathrm{Exp}(1) \end{aligned} \]

Estimation using R’s lm()

We will start by comparing alcohol content to quality ratings
In R’s lm() priors are not specified – they are uniform

fit1_freq <- lm(quality ~ alcohol, data = ds)
arm::display(fit1_freq)

lm(formula = quality ~ alcohol, data = ds)
            coef.est coef.se
(Intercept) 0.00     0.02   
alcohol     0.48     0.02   
---
n = 1359, k = 2
residual sd = 0.88, R-Squared = 0.23

# avoid R's summary() function
summary(fit1_freq)


Call:
lm(formula = quality ~ alcohol, data = ds)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4372 -0.4761 -0.2097  0.6494  3.1666 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.310e-15  2.380e-02    0.00        1    
alcohol      4.803e-01  2.381e-02   20.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8774 on 1357 degrees of freedom
Multiple R-squared:  0.2307,    Adjusted R-squared:  0.2302 
F-statistic:   407 on 1 and 1357 DF,  p-value: < 2.2e-16

Running the Regression in rstanarm

library(rstanarm)
options(mc.cores = parallel::detectCores())
fit1 <- stan_glm(quality ~ alcohol, data = ds, refresh = 0)
summary(fit1)


Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      quality ~ alcohol
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 1359
 predictors:   2

Estimates:
              mean   sd   10%   50%   90%
(Intercept) 0.0    0.0  0.0   0.0   0.0  
alcohol     0.5    0.0  0.4   0.5   0.5  
sigma       0.9    0.0  0.9   0.9   0.9  

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 0.0    0.0  0.0   0.0   0.0  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  3930 
alcohol       0.0  1.0  4059 
sigma         0.0  1.0  3914 
mean_PPD      0.0  1.0  4029 
log-posterior 0.0  1.0  1689 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Plotting

plot(fit1, plotfun = "areas", prob = 0.9)

posterior_vs_prior(fit1)

Evaluating the Model

How good is this model?

p1 <- pp_check(fit1, plotfun = "ppc_dens_overlay"); p2 <- pp_check(fit1, plotfun = "ppc_ecdf_overlay")
p3 <- pp_check(fit1, plotfun = "ppc_stat", stat ="mean"); p4 <- pp_check(fit1, plotfun = "ppc_stat", stat ="sd")
p1 + p2 + p3 + p4

Evaluating the Model

How good is this model?

yrep1 <- posterior_predict(fit1)
dim(yrep1)

[1] 4000 1359

s <- sample(1:nrow(ds), size = 50) # select 50 random records
p1 <- ppc_intervals(ds$quality[s], yrep1[, s]); p1

We Can Add Other Predictors

fit2 <- stan_glm(quality ~ ., data = ds, refresh = 0)
plot(fit2, plotfun = "areas", prob = 0.9, params = "alcohol")

Have We Improved the Model

yrep2 <- posterior_predict(fit2)
p1 <- p1 + ggtitle("Model 1 predictions")
p2 <- ppc_intervals(ds$quality[s], yrep2[, s]) + ggtitle("Model 2 predictions")
p1 + p2 + plot_layout(nrow = 2, byrow = FALSE)

Have We Improved the Model

# Mean Square Error for Model 1
mean((colMeans(yrep1) - ds$quality)^2) %>% round(2)

[1] 0.77

# Mean Square Error for Model 2
mean((colMeans(yrep2) - ds$quality)^2) %>% round(2)

[1] 0.64

width <- function(yrep, q1, q2) {
  q <- apply(yrep, 2, function (x) quantile(x, probs = c(q1, q2)))
  width <- apply(q, 2, diff)
  return(mean(width))
}
width(yrep1, 0.25, 0.75) %>% round(2)

[1] 1.18

width(yrep2, 0.25, 0.75) %>% round(2)

[1] 1.09

Comparing Out of Sample Performance

loo1 <- loo(fit1, cores = 4)
loo2 <- loo(fit2, cores = 4)
loo_compare(loo1, loo2)

     elpd_diff se_diff
fit2    0.0       0.0 
fit1 -117.6      17.8

Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5), 1413–1432. :10.1007/s11222-016-9696-4. Links: published | arXiv preprint.

Can We Still Improve the Model

Quality variable is an ordinal scale variable, so a more appropriate likelihood would also be ordinal
For more information, see ordered logistic distribution in the Stan manual
We are going to use the following parameterization available in Stan:

\[ \text{OrderedLogistic}(k~|~\eta,c) = \left\{ \begin{array}{ll} 1 - \text{logit}^{-1}(\eta - c_1) & \text{if } k = 1, \\[4pt] \text{logit}^{-1}(\eta - c_{k-1}) - \text{logit}^{-1}(\eta - c_{k}) & \text{if } 1 < k < K, \text{and} \\[4pt] \text{logit}^{-1}(\eta - c_{K-1}) - 0 & \text{if } k = K \end{array} \right. \]

A similar model can fit with frequentist methods using MASS::polr() function

Fitting Ordered Logistic

ds$quality <- d$quality - 2          # quality vector is now 1:6
ds$quality <- as.factor(ds$quality)  # it has to be treated as factor (category)
fit3 <- stan_polr(quality ~ ., prior = R2(0.25), # the prior is on explained variance
                  prior_counts = dirichlet(1), 
                  data = ds, iter = 1000, refresh = 0)
yrep3 <- posterior_predict(fit3)
yrep3[1:5, 1:5]

     1   2   3   4   5  
[1,] "1" "3" "4" "4" "4"
[2,] "4" "3" "3" "3" "3"
[3,] "3" "3" "3" "4" "3"
[4,] "3" "3" "3" "3" "3"
[5,] "4" "4" "4" "3" "3"

yrep3 <- apply(yrep3, 2, as.numeric)
yrep3[1:5, 1:5]

     1 2 3 4 5
[1,] 1 3 4 4 4
[2,] 4 3 3 3 3
[3,] 3 3 3 4 3
[4,] 3 3 3 3 3
[5,] 4 4 4 3 3

Evaluating Model Fit

pp_check(fit2, plotfun = "ppc_dens_overlay")

pp_check(fit3, plotfun = "ppc_dens_overlay")

p3 <- ppc_intervals(as.numeric(ds$quality[s]), yrep3[, s]) + ggtitle("Model 3 predictions")
p2 + p3 + plot_layout(nrow = 2, byrow = FALSE)

cat("Mean Square Error for Model 2:", mean((colMeans(yrep2) - scale(d$quality))^2) %>% round(2))

Mean Square Error for Model 2: 0.64

cat("Mean Square Error for Model 3:", mean((colMeans(yrep3) - as.numeric(ds$quality))^2) %>% round(2))

Mean Square Error for Model 3: 0.43

cat("Width of the 50% interval for Model 2:", width(yrep2, 0.25, 0.75) %>% round(2))

Width of the 50% interval for Model 2: 1.09

cat("Width of the 50% interval for Model 3:", width(yrep3, 0.25, 0.75) %>% round(2))

Width of the 50% interval for Model 3: 0.73

10+ quick tips to improve your regression modeling

This advice comes from “Regression and Other Stories” by Gelman and Hill

Think generatively (+)
Think about what you are measuring (+)
Think about variation, replication
Forget about statistical significance
Graph the relevant and not the irrelevant
Interpret regression coefficients as comparisons
Understand statistical methods using fake-data simulation
Fit many models
Set up a computational workflow
Use transformations
Do causal inference in a targeted way, not as a byproduct of a large regression
Learn methods through live examples

So Long (and Thanks for All the Fish)

Lectures: https://ericnovik.github.io/smac.html
Keep in touch:
- eric.novik@nyu.edu
- Twitter: @ericnovik
- https://www.linkedin.com/in/enovik/
- Personal blog: https://ericnovik.com/