Applied Statistics for Social Science Research
\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathscr{L}} \DeclareMathOperator{\I}{\text{I}} \]
It would be inconvenient to enumerate all possible events to describe a stochastic system
A more general approach is to introduce a function that maps sample space \(S\) onto the Real line
For each possible outcome \(s\), random variable \(X(s)\) performs this mapping
This mapping is deterministic. The randomness comes from the experiment, not from the random variable (RV)
While it makes sense to talk about \(\P(A)\), where \(A\) is an event, it does not make sense to talk about \(\P(X)\), but you can say \(\P(X(s) = x)\), which we usually write as \(\P(X = x)\)
Let \(X\) be the number of Heads in two coin flips. You flip the coin twice, and you get \(HH\). In this case, \(s = {HH}\), \(X(s) = 2\), while \(S = \{TT, TH, HT, HH\}\)
\[ F_X(x) = \P(X \leq x) \]
\[ f_X(x) = \P(X = x) \]
\[ F_X(4) = \P(X \leq 4) = \sum_{i = 4,3,2,...}\P(X = i) \]
In R, PMFs and PDFs start with the letter d
. For example dbinom()
and dnormal()
refer to binomial PMF and normal PDF
CDFs start with p
, so pbinom()
and pnorm()
Inverse CDFs or quantile functions, start with q
so qbinom()
and so on
Random number generators start with r
, so rbinom()
A binomial RV, which we will define later, represents the number of successes in N trials. In R, the PMF is dbinom()
and CDF is pbinom()
Here is the full function signature: dbinom(x, size, prob, log = FALSE)
x
is the number of successes, size
is the number of trials N, prob
is the probability of success in each trial \(\theta\), and log
is a flag asking if we want the results on the log scale.Bernoulli RV is one coin flip with a set probability of success (say Heads)
If \(X \sim \text{Bernoulli}(\theta)\), the PMF can be written directly as \(\P(X = x) = \theta^x (1 - \theta)^{1-x}, \, x \in \{0, 1\}\)
Binomial can be thought of as the sum of \(N\) independent Bernoulli trials. We can also write:
\[ \text{Bernoulli}(x~|~\theta) = \left\{ \begin{array}{ll} \theta & \text{if } x = 1, \text{ and} \\ 1 - \theta & \text{if } x = 0 \end{array} \right. \]
\[ \text{Binomial}(x~|~N,\theta) = \binom{N}{x} \theta^x (1 - \theta)^{N - x} \]
\(\text{Binomial}(x~|~N=4,\theta = 1/2)\)
library(patchwork)
library(MASS)
N <- 4 # Number of successes out of x trials
# compute and plot the PMF
pmf <- dbinom(x = 0:N, size = N, prob = 1/2)
d <- data.frame(x = 0:N, y = pmf)
p1 <- ggplot(d, aes(x, pmf))
p1 <- p1 + geom_col(width = .2) +
geom_text(aes(label = fractions(pmf)), nudge_y = 0.02) +
ylab("P(X = x)") + xlab("x = Number of Heads") +
ggtitle("X ~ Binomial(4, 1/2)",
subtitle = expression(PDF: p[X](x) == P(X == x)))
# compute and plot the CDF
x <- seq(-0.5, 4.5, length = 500)
cdf <- pbinom(q = x, size = N, prob = 1/2)
d <- data.frame(q = x, y = cdf)
dd <- data.frame(x = seq(-0.5, 4.5, by = 1), cdf = unique(cdf), x_empty = 0:5)
p2 <- ggplot(d, aes(x, cdf))
p2 <- p2 + geom_point(size = 0.2) +
geom_text(aes(x, cdf, label = fractions(cdf)), data = dd, nudge_y = 0.05) +
geom_point(aes(x_empty, cdf), data = dd[-6, ], size = 2, color = 'white') +
geom_point(aes(x_empty, cdf), data = dd[-6, ], size = 2, shape = 1) +
ggtitle("X ~ Binomial(4, 1/2)",
subtitle = expression(CDF: F[X](x) == P(X <= x))) +
ylab(expression(P(X <= x))) + xlab("x = Number of Heads")
p1 + p2
# What is the probability of getting 2 Heads out of 5 fair trials?
N <- 5; x <- 2
dbinom(x = x, size = N, prob = 0.5) |> fractions()
[1] 5/16
# What is the binomial PMF: P(X = x), for N = 5, p = 0.5?
N <- 5; x <- -2:7 # notice we range x over any integers
dbinom(x = x, size = N, prob = 0.5) |> fractions()
[1] 0 0 1/32 5/32 5/16 5/16 5/32 1/32 0 0
[1] 1
[1] 13/16
[1] 0 0 1/32 3/16 1/2 13/16 31/32 1 1 1
# get from the PMF to CDF; cumsum() is the cumulative sum function
dbinom(x = x, size = N, prob = 0.5) |> cumsum() |> fractions()
[1] 0 0 1/32 3/16 1/2 13/16 31/32 1 1 1
\[ \begin{align} \sum_{x = 0}^{\infty} \theta (1 - \theta)^x = \theta \sum_{x = 0}^{\infty} (1 - \theta)^x \\ \text{Let } u = 1 - \theta \\ \theta \sum_{x = 0}^{\infty} u^x = \theta \frac{1}{1-u} = \theta \frac{1}{1-1 + \theta} = \frac{\theta}{\theta} = 1 \end{align} \]
Let’s consider the infinite geometric series:
\[ S = \sum_{x=0}^{\infty} u^x = u^0 + u^1 + u^2 + u^3 + \dots \] This can be written explicitly as: \[ S = 1 + u + u^2 + u^3 + \dots \]
Next, multiply the entire series by \(u\):
\[ uS = u \cdot (1 + u + u^2 + u^3 + \dots) \]
This results in: \[ uS = u + u^2 + u^3 + u^4 + \dots \]
Now, subtract the equation for \(uS\) from the equation for \(S\):
\[ S - uS = (1 + u + u^2 + u^3 + \dots) - (u + u^2 + u^3 + u^4 + \dots) \]
Notice that all terms on the right-hand side except the first term (which is 1) cancel out:
\[ S - uS = 1 \]
Factor the left-hand side:
\[ S(1 - u) = 1 \]
Finally, solve for \(S\):
\[ S = \frac{1}{1 - u} \]
\[ \E(X) = \sum_{x=0}^{\infty} x \theta (1 - \theta)^x = \frac{1-\theta}{\theta} \]
In particular, for \(X \sim \text{Geom}(1/5)\), what is the expected number of failures before the first success?
The answer is \(4 = \frac{(1 - \frac{1}{5})}{\frac{1}{5}}\)
Question: Geometric is unbounded, yet summed a finite number (100) and got the right answer. What’s going on?
Geometric decays quickly, and higher terms are negligible for these choices of \(\theta\)
Without doing any calculations at all, what is \(\E(X)\), where \(X \sim \text{Geom}(1/8)\)
If \(X \sim \text{Bernoulli}(\theta)\), \(\E(X) = \sum_{x=0}^{1} x \theta^x (1 - \theta)^{1-x} = 0 + 1 \cdot \theta \cdot 1 = \theta\)
For the Binomial, is it easier to think about it as a sum of iid Bernoulli RVs. Each Bernoulli RV has an expectation of \(\theta\), and there are \(n\) of them, so the expectation is \(n \theta\).
This argument relies on the linearity of expectations: \(\E(X_1 + X_2 ... + X_n) = \E(X_1) + \E(X_2) + ... + \E(X_n) = \theta + ... + \theta = n\theta\)
\[ \E(X) = \sum_{n=1}^{\infty} \frac{1}{2^n} \cdot 2^n = \sum_{n=1}^{\infty} 1 = \infty \]
\[ \E(U) = \sum_{n=1}^{\infty} \frac{1}{2^n} \log_2(2^n) = \sum_{n=1}^{\infty} \frac{n}{2^n} = 2 \neq \infty \]
\[ \I_A(s) = \I(s \in A) = \left\{\begin{matrix} 1 & \text{if } s \in A \\ 0 & \text{if } s \notin A \end{matrix}\right. \]
Recall from Lecture 1:
\[ \begin{align} X& \sim \text{Uniform}(-1, 1) \\ Y& \sim \text{Uniform}(-1, 1) \\ \pi& \approx \frac{4 \sum_{i=1}^{N} \text{I}(x_i^2 + y_i^2 < 1)}{N} \end{align} \]
[1] 0.3125
[1] 0 2 0 0 2 1 3 4 3 2 2 1 2 0 2 2 1 0 3 0 2 3 0 3 2 0 2 2 2 4
[1] 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0 1 1 1 0
[1] 0.312888
\[ \begin{eqnarray} E(g(X)) &=& \int_{-\infty}^{\infty} g(x) f_X(x) \, dx \\ E(g(X)) &=& \sum_{x} g(x) p_X(x) \end{eqnarray} \]
var()
and standard deviation with sd()
[1] 29.83
[1] 6.01
[1] 35.83
[1] 35.8
[1] -0.01
[1] 35.8
[1] 36
\[ \begin{eqnarray} \sum_{x=0}^{\infty} \frac{{\lambda^x e^{-\lambda}}}{{x!}} = e^{-\lambda} \sum_{x=0}^{\infty} \frac{\lambda^x}{x!} = 1 \end{eqnarray} \]
\[ \begin{eqnarray} P(a < X < b) & = & \int_{a}^{b} f_X(x)\, dx \\ F_X(x) & = & \int_{-\infty}^{x} f_X(u)\, du \end{eqnarray} \]
Uniform \(X \sim \text{Uniform}(\alpha, \beta)\) has the following PDF:
\[ \text{Uniform}(x|\alpha,\beta) = \frac{1}{\beta - \alpha}, \, \text{where } \alpha \in \mathbb{R} \text{ and } \beta \in (\alpha,\infty) \]
\[ \E(Y)= \int y(x) f_X(x) dx \]
\[ y(x) = \left\{ \begin{array}{ll} x & \text{if } 1/2 < x < 1, \text{ and} \\ 1 - x & 0 < x < 1/2 \end{array} \right. \]
\[ \E(Y) = \int_{0}^{1} y(x) f_X(x)\, dx = \\ \int_{1/2}^{1} x \cdot 1 \, dx + \int_{0}^{1/2} (1-x) \cdot 1 \, dx = \\ \frac{x^2}{2} \Biggr|_{1/2}^{1} + \frac{1}{2} - \frac{x^2}{2} \Biggr|_{0}^{1/2} = \frac{3}{4} \]
\[ \text{Normal}(x \mid \mu,\sigma) = \frac{1}{\sqrt{2 \pi} \ \sigma} \exp\left( - \, \frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^2 \right) \! \]
The bell shape comes from the \(\exp(-x^2)\) part
The expected value is \(\E(X) = \mu\), the mode (highest peak) and median are also \(\mu\).
Variance is \(\V(X) = \sigma^2\) and standard deviation \(\text{sd} = \sigma\)
A Normal RV can be converted to standard normal by subtracting \(\mu\) and dividing by \(\sigma\)
\[ \text{Normal}(x \mid 0, 1) \ = \ \frac{1}{\sqrt{2 \pi}} \, \exp \left( \frac{-x^2}{2} \right)\ \]
mu_m <- 69.1 # mean heights of US males in inches
sigma_m <- 2.9 # standard deviation for same
mu_w <- 63.7 # mean heights of US women in inches
sigma_w <- 2.7 # standard deviation of the same
x <- seq(50, 80, len = 1e3)
pdf_m <- dnorm(x, mean = mu_m, sd = sigma_m)
pdf_w <- dnorm(x, mean = mu_w, sd = sigma_w)
p <- ggplot(data.frame(x, pdf_m, pdf_w), aes(x, pdf_m))
p <- p + geom_line(size = 0.2, color = 'red') +
geom_line(aes(y = pdf_w), size = 0.2, color = 'blue') +
xlab("Height (in)") + ylab("") +
ggtitle("Distribution of heights of US adults") +
annotate("text", x = 73.5, y = 0.10, label = "Men", color = 'red') +
annotate("text", x = 58, y = 0.10, label = "Women", color = 'blue') +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
print(p)
What is the probability that a randomly chosen man is taller than a randomly chosen woman?
We have two distributions \(M \sim \text{Normal}(\mu_m, \sigma_m)\) and \(W \sim \text{Normal}(\mu_w, \sigma_w)\). We want \(\P(M > W) = P(Z > 0),\, \text{where }Z = M - W\)
Sum of two normals is normal where both means and variances sum:
\[ \begin{eqnarray} Z & \sim & \text{Normal}(\mu_m + \mu_w,\, \sigma_m^2 + \sigma_m^2) \\ E(Z) & = & E(M - W) = E(M) - E(W) = 69.1 - 63.7 = 5.4 \\ \text{Var}(Z) & = & \text{Var}(M - W) = \text{Var}(M) + \text{Var}(W) = 2.9^2 + 2.7^2 = 15.7 \\ \text{sd} & = & \sqrt{Var} = \sqrt{15.7} =3.96 \\ Z & \sim & \text{Normal}(5.4, 3.96) \end{eqnarray} \]
By symmetry, the probability that a randomly chosen woman is taller than a randomly chosen man is 0.09
How can we check the analytic solution? We can do a simulation!