SMaC: Statistics, Math, and Computing

Applied Statistics for Social Science Research

Eric Novik | Summer 2024 | Session 6

Session 6 Outline

Vectors and vector arithmetic
Matrices and matrix arithmetic
Determinants
Matrix inverses
Solving linear systems

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\L}{\mathscr{L}} \DeclareMathOperator{\I}{\text{I}} \]

Introduction to Vectors

Linear Algebra offers a language for manipulating n-dimensional objects
Vectors are the basic building block of Linear Algebra
A straightforward way to think about them is as a column of numbers
How we interpret the entries is up to us
Vectors are typically written in a column form:

\[\begin{aligned} V = \left[\begin{matrix}a\\b\\c\end{matrix}\right] \end{aligned}\]

If we want the row version, we transpose it:

\[\begin{aligned} V^T = \left[\begin{matrix}a & b & c\end{matrix}\right] \end{aligned}\]

Introduction to Vectors

In R, vectors, unfortunately, are neither row nor column vectors
R makes some assumptions when performing vector-matrix arithmetic
We already saw lots of vectors whenever we used c() function or generated random numbers with, say runif() function

set.seed(123)
(v <- c(sample(3)))

[1] 3 1 2

class(v)

[1] "integer"

R reports the \(v\) is an integer vector
You can add two vectors in a usual way, elementwise:

\[\begin{aligned} \left[\begin{matrix}3\\1\\2\end{matrix}\right]+\left[\begin{matrix}2\\1\\3\end{matrix}\right]=\left[\begin{matrix}5\\2\\5\end{matrix}\right] \end{aligned}\]

In R:

[1] 3 1 2

[1] 2 1 3

v + w

[1] 5 2 5

Introduction to Vectors

You can take linear combination \(av + bw\), where a and b are scalars

2*v + 3*w # linear combination

[1] 12  5 13

You can also take dot product: \(v \cdot w\), which results in a scalar

\[\begin{aligned} v \cdot w = \left[\begin{matrix}3 & 1 & 2\end{matrix}\right]\left[\begin{matrix}2\\1\\3\end{matrix}\right]=\left[\begin{matrix}13\end{matrix}\right] \end{aligned}\]

In R, we multiply vectors and Matrices with %*%, not *, which will produce a component-wise multiplication, not a dot product

v %*% w # dot product

     [,1]
[1,]   13

v * w   # component wise multiplication

[1] 6 1 6

We write dot product this way:

\[ v^T w = v_1w_1 + v_2w_2 + \cdots + v_nw_n \]

Introduction to Vectors

When is dot product zero?

v <- c(1, 1)
w <- c(1, -1)
v %*% w

     [,1]
[1,]    0

(3 * v) %*% (4 * w)

     [,1]
[1,]    0

These vectors are perpendicular (orthogonal)

Introduction to Vectors

A length of a vector is the square root of the dot product with itself

\[ ||v|| = \sqrt{v \cdot v} = \sqrt{v_1^2 + v_2^2 + \cdots v_n^2} \]

v <- c(1, 1)  
sqrt(v %*% v) # this is sqrt of 2

         [,1]
[1,] 1.414214

Introduction to Matrices

You can think of a matrix as a rectangular (or square) set of numbers
\(m{\times}n\) matrix has m rows and n columns
In statistics, it is convenient to think of a matrix as n columns where each column is a variable like the price of the house, and the rows are the observations for each house
In Linear Algebra books, you will often see linear equations written as \(Ax = b\), where we are trying to find \(x\)
In statistics, we usually write the same thing as \(X\beta = y\), and we are trying to find \(\beta\)
Another source of confusion: matrix \(A\) is sometimes called a coefficient matrix in Linear Algebra books. In statistics, the entries in \(A\) (the \(X\) matrix) have observations, and the \(\beta\) vector contains coefficients estimated from A and b (X and y).

A good way to think about the Matrices is that they act on vectors and transform them in some way (stretch or turn them)
You can think of this transformation as encoding the eventual locations of the basis vectors
Standard basis vectors in \(R^2\) (two-dimensional space), are commonly called \(\hat{i}\) with location \((1, 0)\), and \(\hat{j}\) with location \((0, 1)\).
Notice their dot product is zero: \(\hat{i} \cdot \hat{j} = 0\) and so they are orthogonal
Let’s look at one simple transformation – rotation by 90 degrees counter-clockwise

Recall, vectors \(v\) and \(w\):

\[ v = \left[ \begin{matrix}1\\2\end{matrix} \right] w = \left[ \begin{matrix}3\\1\end{matrix} \right] \]

The following matrix encodes a 90-degree, counterclockwise rotation. Why?

\[ R = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} \]

Let’s see how this matrix will act on vectors \(v\) and \(w\). That’s where multiplication comes in.

\[ Rv = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \end{bmatrix} = 1 \begin{bmatrix} 0 \\ 1 \end{bmatrix} + 2 \begin{bmatrix} -1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \end{bmatrix} + \begin{bmatrix} -2 \\ 0 \end{bmatrix} = \begin{bmatrix} -2 \\ 1 \end{bmatrix} \]

We scale the first vector, then scale the second, and then add.

The following shows the same operation you can do in one step in R.

R <- matrix(c(0, 1, -1, 0), ncol = 2)
R

     [,1] [,2]
[1,]    0   -1
[2,]    1    0

v <- c(1, 2)
(v_prime <- R %*% v)

     [,1]
[1,]   -2
[2,]    1

Now let’s plot \(v\) and \(v'\). Notice, the 90 degree rotation

Matrix Multiplication

Multiplication
Plot
Your Turn

Our matrix encodes a rotation of every single vector in \(R^2\).
In particular, it rotates every point \((x, y)\) in \(R^2\) by 90 degrees
We can do this transformation in one go by applying the Rotation matrix to another matrix comprising our vectors \(v\) and \(w\)
The number of columns in the Rotation matrix has to match the number of rows in the \(K = [v, w]\) matrix

(K <- matrix(c(v, w), ncol = 2))

     [,1] [,2]
[1,]    1    3
[2,]    2    1

(K_prime <- R %*% K)

     [,1] [,2]
[1,]   -2   -1
[2,]    1    3

The resulting matrix \(K\), has the correctly rotated \(v\) in the first column and rotated \(w\) in the second column.
Another way to think about this operation is to encode two transformations in \(K'\) — the \(K\) transformation followed by the \(R\) transformation. We can now use the resulting \(K'\) matrix and apply these two transformations in one swoop to any vector in \(R^2\).
Think about why matrix multiplication, in general, does not commute — \(RK \neq KR\)

Rotating red (\(v = [1 \ 2]\)) and blue (\(w = [3 \ 1]\)) vectors by 90 degrees

Your Turn: Come up with a 90-degree clockwise rotation matrix and show that it sends \((2, 2)\) to \((2, -2)\)
Now pick three vectors in \(R^2\) and rotate all three at the same time

Linear Independence and Determinants

Think of a determinant as a (signed) area scaling factor from the basis \((\hat{i}, \hat{j})\) to the transformed vectors
The area represented by two \(R^2\) basis vectors is 1

# this matrix scales the area by 12
(X <- matrix(c(4, 0, 0, 3), ncol = 2))

     [,1] [,2]
[1,]    4    0
[2,]    0    3

# compute the determinant
det(X)

[1] 12

What happens if the columns of \(X\) are linear combinations of each other?
Geometrically, that means that in \(R^2\), they both sit on the same line

# the second column is the first column * 1.5
(X <- matrix(c(2, 1, 1.5 * 2, 1.5 * 1), ncol = 2))

     [,1] [,2]
[1,]    2  3.0
[2,]    1  1.5

# compute the determinant
det(X)

[1] 0

Example: Random Matrix

Generate a random, say 5x5 matrix
Is it likely that we will get linearly dependent columns?

set.seed(123)
(X <- matrix(sample(25), ncol = 5))

     [,1] [,2] [,3] [,4] [,5]
[1,]   15   18    9   25    2
[2,]   19   11   21   17   16
[3,]   14    5   24    1    7
[4,]    3   23   20   12    8
[5,]   10    6   22   13    4

det(X)

[1] 1155943

This is a full rank matrix that encodes a transformation in \(R^5\) — 5-dimensional space

Computing Determinants

Don’t bother doing it by hand

\[\begin{aligned} A = \left[\begin{matrix}a & c\\b & d\end{matrix}\right] \end{aligned}\] \[\begin{aligned} \det(A) = a d - b c \end{aligned}\] \[\begin{aligned} B = \left[\begin{matrix}a & d & g\\b & e & h\\c & f & i\end{matrix}\right] \end{aligned}\] \[\begin{aligned} \det(B) = a e i - a f h - b d i + b f g + c d h - c e g \end{aligned}\]

Linear Systems and Inverses

Linear Systems
No solution
Idea of an inverse
Inverse in R
Ax = b

One of the key ideas from Linear Algebra that is relevant to statistics is solving linear systems
This is where estimating coefficients in \(X\beta = y\) comes from
If \(X\) is a square matrix, the problem is not statistical but algebraic
The reason is that if we have the same number of equations as we have unknowns, we can solve the system exactly unless X is not full rank
For example, try to solve this system:

\[ 2x + y = 1 \\ 4x + 2y = 1 \]

Question: Geometrically, what does it mean to solve a system of equations?

You should now have a deeper understanding of why there is no solution

(A <- matrix(c(2, 4, 1, 2), ncol = 2))

     [,1] [,2]
[1,]    2    1
[2,]    4    2

det(A)

[1] 0

Before we talk about overdetermined systems (more equations than unknowns) that we see in statistics, let’s see how it works in the square matrix case

To see what a matrix inverse does, let’s bring back the 90-degree counterclockwise rotation matrix \(A\)

\[\begin{aligned} A = \left[\begin{matrix}0 & -1\\1 & 0\end{matrix}\right] \end{aligned}\]

The idea of an inverse is to reverse the action of this transformation
In this case, we need a clockwise 90-degree rotation matrix
We can find that matrix by directly tracing the basis vectors

\[\begin{aligned} A^{-1} = \left[\begin{matrix}0 & 1\\-1 & 0\end{matrix}\right] \end{aligned}\]

Let’s check our results in R. When you multiply a matrix by its inverse, you get back the identity matrix (which is like when you divide a scalar by its reciprocal, you get 1)

(A <- matrix(c(0, 1, -1, 0), 2, 2))

     [,1] [,2]
[1,]    0   -1
[2,]    1    0

(A_inv <- matrix(c(0, -1, 1, 0), 2, 2))

     [,1] [,2]
[1,]    0    1
[2,]   -1    0

A_inv %*% A

     [,1] [,2]
[1,]    1    0
[2,]    0    1

# find an inverse using R's solve function
solve(A)

     [,1] [,2]
[1,]    0    1
[2,]   -1    0

We are now ready to solve the square linear system \(Ax = b\), in the case when A is full rank

\[ \begin{eqnarray} Ax & = & b \\ A^{-1}Ax & = & A^{-1}b \\ x & = & A^{-1}b \end{eqnarray} \]

The general solution in two-dimensional case:

\[\begin{aligned} \left[\begin{matrix}a & c\\b & d\end{matrix}\right]x = \left[\begin{matrix}b_{1}\\b_{2}\end{matrix}\right] \end{aligned}\] \[\begin{aligned} \det(A) = a d - b c \end{aligned}\] \[\begin{aligned} A^{-1} = \left[\begin{matrix}\frac{d}{a d - b c} & - \frac{c}{a d - b c}\\- \frac{b}{a d - b c} & \frac{a}{a d - b c}\end{matrix}\right] \end{aligned}\] \[\begin{aligned} x = \left[\begin{matrix}\frac{b_{1} d - b_{2} c}{a d - b c}\\\frac{a b_{2} - b b_{1}}{a d - b c}\end{matrix}\right] \end{aligned}\]

In R, solve(A) inverts the matrix, and solve(A, b) solves \(Ax = b\).

(A <- matrix(sample(9), ncol = 3))

     [,1] [,2] [,3]
[1,]    3    1    7
[2,]    4    9    2
[3,]    6    5    8

(b <- c(1, 2, 3))

[1] 1 2 3

solve(A, b) %>% round(2)

[1]  0.93 -0.14 -0.24

# same as above
solve(A) %*% b %>% round(2)

      [,1]
[1,]  0.93
[2,] -0.14
[3,] -0.24

Your Turn

Consider the following system of equations:

\[ \begin{eqnarray} 3x + 2y + 1.5z & = & 4 \\ 7x + y & = & 2 \\ 3y + 2z & = & 1 \end{eqnarray} \]

Express it in matrix form, solve it in R using the solve() function, and validate that the results are correct

Some Useful Rules

If you put on socks and then shoes, the first to be taken off are the ____

— Gilbert Strang (discussing the order of undoing the inverse)

\[ (ABC \dots)^{-1} = \dots C^{-1}B^{-1}A^{-1} \\ (A^T)^{-1} = (A^{-1})^T \\ (A + B)^T = A^T + B^T \\ (ABC \dots)^T = \dots C^T B^T A^T \]

Solving n > p Systems

Frequentist statistical inference is concerned with “solving” overdetermined systems
We have lots of observations with relatively few variables. (Sometimes, we have more variables than observations, but we will not discuss it here)
Clearly, there is no exact solution
In fact, there are typically infinitely many ways in which you can draw a line through a cloud of points or a plane through n-dimensional space
Optimization based (or frequentist) inference is concerned with finding the most likely values of the unknowns giving rise to that line (or plane) and approximating their variance
Integration based (or Bayesian) inference is concerned with finding a joint PDF of the unknowns – all plausible values weighted by their probability. Therefore, the “estimated variance” is a consequence of this PDF. (Recall our example of estimating variance from the distribution of the difference in heights.) In this sense, Bayesian inference is an uncertainty-preserving system, a very desirable quality.
For simple linear models, both of these methods produce similar results.

Solving an Overdetermined \(X\hat{\beta} = y\)

We are switching to a more familiar (to statisticians) notation of \(X \beta = y\) (instead of \(Ax = b\))
We typically augment the \(X\) matrix with the column of 1s on the left to model the intercept term, sometimes called \(\beta_0\)
This should make sense when you think about \(X \beta\) as the linear combination of columns of \(X\)
In this form, \(X\) is usually called the design matrix or the model matrix
The problem is that \(X\) is not invertible, even if columns of \(X\) are linearly independent
Let’s say that \(X\) is a \(3 \times 2\) matrix (3 rows and 2 columns), and columns are linearly independent. Note that \(\beta\) is \(2 \times 1\) and \(y\) is \(3 \times 1\).
\(X\) and its linear combinations \(X\beta\) span a plane in three-dimensional space – there is no way for it to reach the target vector \(y\) in \(R^3\)

Approximating a Solution to Overdetermined \(X\hat{\beta} = y\)

If we can not reach \(y\), we need to find a vector in the column space of \(X\) that is closest (in a certain sense) to \(y\)
The projection of \(y\) onto this plane gives us the answer

Solving an Overdetermined \(X\hat{\beta} = y\)

Solution
Example of an \(X^TX\)

This residual \(r\) is also called the error term
This error, \(y - X\hat{\beta}\), is the smallest when it’s perpendicular to the plane
Another way of saying that is that:

\[ X^T(y - X\hat{\beta}) = 0 \]

We can now solve this normal equation:

\[ \begin{eqnarray} X^T(y - X\hat{\beta}) & = & 0 \\ X^TX\hat{\beta} & = & X^Ty \\ (X^TX)^{-1}(X^TX)\hat{\beta} & = & (X^TX)^{-1}X^Ty \\ \hat{\beta} & = & (X^TX)^{-1}X^Ty \end{eqnarray} \]

Matrix \(X^TX\) is square and symmetric. It is also invertible if the columns of \(X\) are linearly independent:

(X <- matrix(sample(6), ncol = 2))

     [,1] [,2]
[1,]    2    1
[2,]    5    4
[3,]    3    6

(XtX <- t(X) %*% X)

     [,1] [,2]
[1,]   38   40
[2,]   40   53

solve(XtX) %>% round(2)

      [,1]  [,2]
[1,]  0.13 -0.10
[2,] -0.10  0.09

(X <- matrix(c(1, 2, 3, 2, 4, 6), ncol = 2))

     [,1] [,2]
[1,]    1    2
[2,]    2    4
[3,]    3    6

(XtX <- t(X) %*% X)  # will not invert

     [,1] [,2]
[1,]   14   28
[2,]   28   56

Motion in the Straight Line

Recall, from lecture 1, our example of a car moving in a straight line where we have noisy measurements of its position over time
At the time, we assumed we were able somehow to find the intercept and slope of the equation
We are now in a position to solve the problem

Motion in the Straight Line

Setup
Solution
Plot the Fit

This is a quadratic function. How can we solve it using linear regression (the least squares method)?
We assumed the equation of motion was \(x(t) = a + bt^2\), which we will write as \(y(t) = \beta_0 + \beta_1 t^2\)
The unknowns are \(\beta_0\) and \(\beta_1\)
Let’s express it in the matrix notation
What is our design matrix \(X\)? We have 51 observations, so \(X\) will have two columns and 51 rows: a column of 1s for the intercept and a column of times squared \(t^2\)
Our unknown vector \(\hat{\beta}\) has two elements \((\hat{\beta_0},\, \hat{\beta_1)}\)
\(y\) is our outcome vector of length 51, capturing the car’s position at each time point \(t\)
\(y = X\hat{\beta}\) captures our system in matrix form

# length of t (our time vector)
length(t)

[1] 51

# first few elements of t
head(t)

[1] 0.0 0.1 0.2 0.3 0.4 0.5

# construct the design matrix
X <- matrix(c(rep(1, length(t)), t^2), ncol = 2)
head(X)

     [,1] [,2]
[1,]    1 0.00
[2,]    1 0.01
[3,]    1 0.04
[4,]    1 0.09
[5,]    1 0.16
[6,]    1 0.25

# inspect the vector y
length(y)

[1] 51

head(y)

[1] -2.4417028  6.7615084 -0.7502334 -0.4900157 -3.5129263  1.9331119

Recall, that the least squares estimate of \(\beta\) is given by \(\hat{\beta} = (X^TX)^{-1}X^Ty\)

beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
beta_hat %>% round(2)

     [,1]
[1,] 1.46
[2,] 3.01

data <- data.frame(y = y, t_squared = t^2)
# compare to R's lm()
fit <- lm(y ~ t_squared, data = data)
arm::display(fit)

lm(formula = y ~ t_squared, data = data)
            coef.est coef.se
(Intercept) 1.46     0.54   
t_squared   3.01     0.05   
---
n = 51, k = 2
residual sd = 2.60, R-Squared = 0.99

Warning: do not use \((X^TX)^{-1}X^Ty\) directly as above; it’s computationally inefficient

d <- data.frame(yhat = beta_hat[1] + beta_hat[2] * x^2)
p + geom_line(aes(y = yhat), data = d, size = 0.2, color = 'red')

Your Turn

Look at the dataset mtcars
Plot mpg against wt, treating mpg as the output variable \(y\). (We generally don’t like to use the words dependent vs. independent in this context)
Using the normal equations, find the intercept and slope of the best-fit line
Validate that it matches the output from lm
Now plot the line over the data and see if it “fits”