SMaC: Statistics, Math, and Computing

Applied Statistics for Social Science Research

Eric Novik | Summer 2024 | Session 08

Session 8 Outline

  • Statistical analysis workflow
  • Simulating data for simple regression
  • Binary predictor in a regression
  • Adding continuous predictor
  • Combining binary and continuous predictors
  • Interpreting coefficients
  • Adding interactions
  • Uncertainty and predictions
  • Assumptions of the regression analysis
  • Regression modeling advice

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathscr{L}} \DeclareMathOperator{\I}{\text{I}} \]

Analysis Workflow

  • Following is a high-level, end-to-end view from “R for Data Science” by Wickham and Grolemund

Some Notation

  • Observed data \(y\)
  • Unobserved but observable data \(\widetilde{y}\)
  • Unobservable parameters \(\theta\)
  • Covariates \(X\)
  • Prior distribution \(f(\theta)\) (for Bayesian inference)
  • Data model or sampling distribution (as a function of y) \(f(y \mid \theta, X)\)
  • Likelihood (as a function of \(\theta\)), \(f(y \mid \theta, X)\), sometimes written as \(\mathcal{L}(\theta \mid y, X)\)
  • Maximum likelihood estimate (MLE): \(\hat{\theta} = \underset{\theta}{\text{argmax}} \, \mathcal{L}(\theta \mid y, X)\) (for Frequentist inference)
  • Posterior distribution \(f(\theta \mid y, X)\) (for Bayesian inference)

Bayes vs Frequentist Inference

  • Estimation is the process of figuring out the unknowns, i.e., unobserved quantities
  • In frequentist inference, the problem is framed in terms of the most likely value(s) of \(\theta\)
  • Bayesians want to characterize the whole distribution, a much more ambitious goal
  • Suppose we want to characterize the following function, which represents some distribution of the unknown parameter:

Basic Regression Setup

  • There are lots of possible ways to set up a regression
  • One predictor: \(y = a + bx + \epsilon\)
  • Multiple predictors: \(y = b_0 + b_1x_1 + b_2x_2 + \cdots + b_nx_n + \epsilon\), which we can write as \(y = X\beta + \epsilon\)
  • Models with interactions: \(y = b_0 + b_1x_1 + b_2x_2 + b_3x_1x_2 + \epsilon\)
  • Non-linear models like: \(\log(y) = a + b\log(x) + \epsilon\)
  • Generalized Linear Models (GLMs), which can, for example, fit binary data, categorical data, and so on
  • Nonparametric models that are popular in Machine Learning that can learn flexible, functional forms between \(y\) and \(X\)
  • The latter is not a panacea — if you have a good or good enough functional form, predictions will generally be better and the model will generalize better

Simulating a Simple Regression

  • Fitting simulated data is a good start for an analysis
library(rstanarm)
n <- 15; x <- 1:n
a <- 0.5
b <- 2
sigma <- 3
y <- a + b*x + rnorm(n, mean = 0, sd = sigma) # or sigma * rnorm(n)
data <- data.frame(x, y)
fit1 <- stan_glm(y ~ x, data = data, refresh = 0) 
print(fit1)
stan_glm
 family:       gaussian [identity]
 formula:      y ~ x
 observations: 15
 predictors:   2
------
            Median MAD_SD
(Intercept) 0.1    1.4   
x           2.1    0.2   

Auxiliary parameter(s):
      Median MAD_SD
sigma 2.6    0.5   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Simulating a Simple Regression

  • We extract our betas using the coef() R function
a_hat <- coef(fit1)[1]
b_hat <- coef(fit1)[2]
p <- ggplot(aes(x, y), data = data)
p + geom_point(size = 0.2) + geom_abline(intercept = a_hat, slope = b_hat, linewidth = 0.1)

KidIQ Dataset

  • Data from a survey of adult American women and their children (a subsample from the National Longitudinal Survey of Youth).
knitr::kable(rstanarm::kidiq[1:8, ])
kid_score mom_hs mom_iq mom_age
65 1 121.11753 27
98 1 89.36188 25
85 1 115.44316 27
83 1 99.44964 25
115 1 92.74571 27
98 0 107.90184 18
69 1 138.89311 20
106 1 125.14512 23

Single Binary Predictor

fit2 <- stan_glm(kid_score ~ mom_hs, data = kidiq, refresh = 0) 
print(fit2)
stan_glm
 family:       gaussian [identity]
 formula:      kid_score ~ mom_hs
 observations: 434
 predictors:   2
------
            Median MAD_SD
(Intercept) 77.6    2.0  
mom_hs      11.8    2.4  

Auxiliary parameter(s):
      Median MAD_SD
sigma 19.9    0.7  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
  • Our model is \(\text{kid_score} = 78 + 12 \cdot \text{mom_hs} + \epsilon\)
  • What is the average IQ of kids whose mothers did not complete high school? (in this dataset)
  • What is the average IQ of kids whose mothers completed high school?
  • Did high school cause the IQ change?

Single Binary Predictor

p <- ggplot(aes(mom_hs, kid_score), data = kidiq)
p + geom_jitter(height = 0, width = 0.1, size = 0.2) +
  geom_abline(intercept = coef(fit2)[1], slope = coef(fit2)[2], linewidth = 0.1) +
  scale_x_continuous(breaks = c(0, 1))

Single Continous Predictor

fit3 <- stan_glm(kid_score ~ mom_iq, data = kidiq, refresh = 0) 
print(fit3)
stan_glm
 family:       gaussian [identity]
 formula:      kid_score ~ mom_iq
 observations: 434
 predictors:   2
------
            Median MAD_SD
(Intercept) 26.0    6.1  
mom_iq       0.6    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 18.3    0.6  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
  • Our model is \(\text{kid_score} = 26 + 0.6 \cdot \text{mom_iq}+ \epsilon\)
  • How much do kids’ scores improve when maternal IQ differs by 10 points?
  • What is the meaning of the intercepts = 26 here?

Simple Transformations

  • We will create a new variable called mom_iq_c, which stands for centered
  • The transformation: \(\text{mom_iq_c}_i = \text{mom_iq}_i - \overline{\text{mom_iq}}\)
kidiq <- kidiq |>
  mutate(mom_iq_c = mom_iq - mean(mom_iq))
fit4 <- stan_glm(kid_score ~ mom_iq_c, data = kidiq, refresh = 0) 
print(fit4)
stan_glm
 family:       gaussian [identity]
 formula:      kid_score ~ mom_iq_c
 observations: 434
 predictors:   2
------
            Median MAD_SD
(Intercept) 86.8    0.9  
mom_iq_c     0.6    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 18.3    0.6  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
  • Our model is \(\text{kid_score} = 87 + 0.6 \cdot \text{mom_iq_c}+ \epsilon\)
  • What is the meaning of the intercepts in this model?

Single Continous Predictor

p <- ggplot(aes(mom_iq, kid_score), data = kidiq)
p + geom_point(size = 0.2) +
  geom_abline(intercept = coef(fit3)[1], slope = coef(fit3)[2], linewidth = 0.1)

Combining the Predictors

fit5 <- stan_glm(kid_score ~ mom_hs + mom_iq_c, data = kidiq, refresh = 0)
print(fit5)
stan_glm
 family:       gaussian [identity]
 formula:      kid_score ~ mom_hs + mom_iq_c
 observations: 434
 predictors:   3
------
            Median MAD_SD
(Intercept) 82.2    1.9  
mom_hs       6.0    2.2  
mom_iq_c     0.6    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 18.2    0.6  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
  • Our model is now: \(\text{kid_score} = 82 + 6 \cdot \text{mom_hs} + 0.6 \cdot \text{mom_iq_c}+ \epsilon\)
  • Write down the interpretation of each coefficient

Combining the Predictors

stan_glm
 family:       gaussian [identity]
 formula:      kid_score ~ mom_hs + mom_iq
 observations: 434
 predictors:   3
------
            Median MAD_SD
(Intercept) 25.8    5.9  
mom_hs       6.0    2.3  
mom_iq       0.6    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 18.2    0.6  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
  • Our model is: \(\text{kid_score} = 26 + 6 \cdot \text{mom_hs} + 0.6 \cdot \text{mom_iq}+ \epsilon\)
  • For moms that did not complete high school, the line \(y = 26 + 0.6 \cdot \text{mom_iq}\)
  • For moms that did: \(\text{kid_score} = 26 + 6 \cdot 1 + 0.6 \cdot \text{mom_iq} = 32 + 0.6 \cdot \text{mom_iq}\)

Visualizing the Fit With Two Predictors

p <- ggplot(aes(mom_iq, kid_score), data = kidiq)
p + geom_point(aes(color = as.factor(mom_hs)), size = 0.2) +
  labs(color = "Mom HS") +
  geom_abline(intercept = coef(fit6)[1], slope = coef(fit6)[3], linewidth = 0.1, color = 'red') +
  geom_abline(intercept = coef(fit6)[1] + coef(fit6)[2], slope = coef(fit6)[3], linewidth = 0.1, color = 'blue')

Adding Interactions

  • In the previous model, we forced the slopes of mothers who completed and did not complete high school to be the same — the only difference was the intercept
  • To allow the slopes to vary, we include an interaction term
stan_glm
 family:       gaussian [identity]
 formula:      kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq
 observations: 434
 predictors:   4
------
              Median MAD_SD
(Intercept)   -10.6   13.7 
mom_hs         50.2   15.1 
mom_iq          1.0    0.1 
mom_hs:mom_iq  -0.5    0.2 

Auxiliary parameter(s):
      Median MAD_SD
sigma 18.0    0.6  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
  • Our model is: \(\text{kid_score} = -10 + 49 \cdot \text{mom_hs} + 1 \cdot \text{mom_iq} - 0.5 \cdot \text{mom_hs} \cdot \text{mom_iq} + \epsilon\)
  • To figure out the slopes, we consider two cases
  • For \(\text{mom_hs} = 0\), the line is: \(\text{kid_score} = -10 + 1 \cdot \text{mom_iq}\)
  • For \(\text{mom_hs} = 1\), the line is: \(\text{kid_score} = -10 + 49 \cdot 1 + 1 \cdot \text{mom_iq} - 0.5 \cdot 1 \cdot \text{mom_iq} = 39 + 0.5\cdot \text{mom_iq}\)

Visualizing Interactions

p <- ggplot(aes(mom_iq, kid_score), data = kidiq)
p + geom_point(aes(color = as.factor(mom_hs)), size = 0.2) +
  labs(color = "Mom HS") +
  geom_abline(intercept = coef(fit7)[1], slope = coef(fit7)[3], linewidth = 0.1, color = 'red') +
  geom_abline(intercept = coef(fit7)[1] + coef(fit7)[2], slope = coef(fit7)[3] + coef(fit7)[4], linewidth = 0.1, color = 'blue')

Interpreting Coefficients in Complex Non-Linear Models

  • Models that have many parameters, many interactions, and are non-linear are difficult to interpret by looking at parameters marginally (i.e., one at a time)
  • For this reason, we typically rely on assessing how changes to model inputs affect model predictions, not the impact of each parameter
  • Predictions take all the parameter values and their interactions into account

Uncertainty and Prediction

  • We return to our simple linear regression model: \(\text{kid_score} = 26 + 0.6 \cdot \text{mom_iq}+ \epsilon\)
  • We can extract the simulations of all plausible parameter values (called a posterior distribution in Bayesian analysis)
post <- as.matrix(fit3)
dim(post)
[1] 4000    3
knitr::kable(post[1:5, ])
(Intercept) mom_iq sigma
31.77474 0.5396861 18.28899
27.21061 0.5917098 18.20332
27.27470 0.5942835 18.29135
26.54087 0.6114706 18.79476
25.46339 0.6054314 17.86559

Uncertainty and Prediction

  • Using these simulations we display 50 plausible regression lines
post <- as.data.frame(fit3)
post_sub <- post[sample(1:nrow(post), 50), ]
p <- ggplot(aes(mom_iq, kid_score), data = kidiq)
p + geom_point(size = 0.2) + 
  geom_abline(aes(intercept = `(Intercept)`, slope = mom_iq), 
              linewidth = 0.1,
              data = post_sub)

Uncertainty and Prediction

  • We can compute inferential uncertainty in all the parameters directly
  • We can also compute event probabilities
post <- as.matrix(fit3)
# median
apply(post, 2, median)
(Intercept)      mom_iq       sigma 
 26.0236767   0.6078977  18.2714868 
apply(post, 2, mad)
(Intercept)      mom_iq       sigma 
 6.10031655  0.06062729  0.61543505 
# 50% uncertainty estimate
apply(post, 2, quantile, probs = c(0.25, 0.75))
     parameters
      (Intercept)    mom_iq    sigma
  25%    21.75756 0.5681001 17.86710
  75%    30.06156 0.6504483 18.69458
# 90% uncertainty estimate
apply(post, 2, quantile, probs = c(0.05, 0.95))
     parameters
      (Intercept)    mom_iq    sigma
  5%     15.72707 0.5106528 17.32535
  95%    35.77134 0.7095889 19.33193
# What is the probability that the coefficient of mom_iq is > 0.6
mean(post[, "mom_iq"] > 0.6) |> round(2)
[1] 0.55

Types of Prediction

  • Point prediction: \(\hat{a} + \hat{b}x^{\text{new}}\) — predicting expected average \(y\) — don’t recommend
  • Linear predictor with uncertainty in \(a\) and \(b\): \(a + bx^{\text{new}}\) — predicting uncertainty around the expexted average value of \(y\) — typically used to assess treatment effect
  • Predictive distribution: \(a + bx^{\text{new}} + \epsilon\) — predictive uncertainty around a new \(y\) — used to predict for a new individual (not ATT)

Prediction in RStanArm

  • We observe a new kid whose mom’s IQ = 120
  • Point prediction is shown in Red, 90% uncertainty around the average kid score in Blue, and 90% predictive uncertainty for a new kid is in Yellow.
mom120 <- data.frame(mom_iq = 120)
(point_pred <- predict(fit3, newdata = mom120))
      1 
98.9816 
lin_pred <- posterior_linpred(fit3, 
                              newdata = mom120)
post_pred <- posterior_predict(fit3, 
                               newdata = mom120)
(lp_range <- quantile(lin_pred, 
                      probs = c(0.05, 0.95)))
      5%      95% 
 96.5370 101.4359 
(pp_range <- quantile(post_pred, 
                      probs = c(0.05, 0.95)))
       5%       95% 
 68.88551 128.75195 

Linear Regression Assumptions

Presented in the order of importance

  • Validity of the model for the question at hand
  • Representativeness and valid scope of inference
  • Additivity and linearity
  • Independence of errors
  • Equal variance of errors
  • Normality of errors

10+ quick tips to improve your regression modeling

This advice comes from “Regression and Other Stories” by Gelman and Hill

  • Think generatively
  • Think about what you are measuring
  • Think about variation, replication
  • Forget about statistical significance
  • Graph the relevant and not the irrelevant
  • Interpret regression coefficients as comparisons
  • Understand statistical methods using fake-data simulation
  • Fit many models
  • Set up a computational workflow
  • Use transformations
  • Do causal inference in a targeted way, not as a byproduct of a large regression
  • Learn methods through live examples

Thanks for being part of my class!

  • Lectures: https://ericnovik.github.io/smac.html
  • Keep in touch:
    • eric.novik@nyu.edu
    • https://www.linkedin.com/in/enovik/
    • Twitter: @ericnovik
    • Personal blog: https://ericnovik.com/