9 Design Matrix

9.1 Setup

We begin with the normal linear model. We assume that \(y_i \sim \mathcal{N}(\mu_i, \sigma^2)\), where \(\mu_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik}\).

This is equivalent to writing \(y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik} + \varepsilon_i\) with \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\). These two forms are the same. The first emphasizes the probability model, the other looks like the regression equation you know from econometrics.

If we create a a column vector of coefficients \(\beta = [\beta_0, \beta_1, ..., \beta_k]^\top\) and create a matrix \(X\) by stacking a column of ones and the predictors side-by-side

\[ X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1k} \\ 1 & x_{21} & x_{22} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nk} \end{bmatrix} \]

then the mean for observation \(i\) is \(\mu_i = X_i \beta\) and the entire vector of means is \(\mu = X\beta\). This matrix \(X\) is called the design matrix.

We can then write the model as \(y_i \sim \mathcal{N}(\mu_i, \sigma^2)\), where \(\mu_i = X_i\beta\).

9.2 Scalar and Matrix Notation

9.2.1 A Single Observation

Consider a simple regression with two predictors. For observation \(i\), the scalar form is

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i. \]

We can also write this as a row–column product

\[ x_i \beta = \begin{bmatrix} 1 & x_{i1} & x_{i2} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}. \]

This shows that the matrix form is nothing more than the scalar form written compactly.

9.2.2 All Observations

Stacking the observations gives the design matrix \(X\) and the fitted values \(X\beta\):

\[ X = \begin{bmatrix} 1 & x_{11} & x_{12} \\ 1 & x_{21} & x_{22} \\ \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} \end{bmatrix}, \qquad X\beta = \begin{bmatrix} \beta_0 + \beta_1 x_{11} + \beta_2 x_{12} \\ \beta_0 + \beta_1 x_{21} + \beta_2 x_{22} \\ \vdots \\ \beta_0 + \beta_1 x_{n1} + \beta_2 x_{n2} \end{bmatrix}. \]

Each row \(X_i\) corresponds to the predictors for observation \(i\) and \(X_i\beta\) is equivalent to the scalar regression equation for observation \(i\)..

9.2.3 General Form with \(k\) Predictors

The same idea extends to \(k\) predictors. The \(i\)th row of \(X\) is

\[ X_i = \begin{bmatrix} 1 & x_{i1} & x_{i2} & \cdots & x_{ik} \end{bmatrix}, \]

and the vector of coefficients is

\[ \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{bmatrix}. \]

The row–column product is

\[ X_i \beta = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik}, \]

which is exactly the scalar regression equation. Stacking the \(n\) rows gives the design matrix \(X\) of dimension \(n \times (k+1)\). There is nothing new here; the matrix notation simply collects all the scalar equations into one compact expression.

9.2.4 The Intercept

The intercept corresponds to the first column of ones in the design matrix. For ten observations with two predictors we can write

\[ X = \begin{bmatrix} 1 & x_{11} & x_{12} \\ 1 & x_{21} & x_{22} \\ 1 & x_{31} & x_{32} \\ 1 & x_{41} & x_{42} \\ 1 & x_{51} & x_{52} \\ 1 & x_{61} & x_{62} \\ 1 & x_{71} & x_{72} \\ 1 & x_{81} & x_{82} \\ 1 & x_{91} & x_{92} \\ 1 & x_{10,1} & x_{10,2} \end{bmatrix} = \begin{bmatrix} 1 & 2.0 & 4.1 \\ 1 & 3.5 & 1.2 \\ 1 & 4.7 & 2.8 \\ 1 & 1.9 & 3.6 \\ 1 & 2.8 & 2.2 \\ 1 & 3.3 & 4.5 \\ 1 & 1.5 & 2.7 \\ 1 & 2.6 & 3.9 \\ 1 & 4.2 & 1.5 \\ 1 & 3.0 & 2.4 \end{bmatrix}. \]

The column of ones is a convenient way to make the constant intercept \(\beta_0\) act like a constant.

9.2.5 Interpretation

Suppose \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}\).

Differentiating \(\mu_i\) with respect to \(x_{i1}\) gives \(\frac{\partial \mu_i}{\partial x_{i1}} = \beta_1\).
Differentiating \(\mu_i\) with respect to \(x_{i2}\) gives \(\frac{\partial \mu_i}{\partial x_{i2}} = \beta_2\).

Thus, a one-unit increase in \(x_{ij}\) changes \(\mu_i\) by \(\beta_j\), holding the other predictor.

On “effects” in regression models

When we say that \(\frac{\partial \mu_i}{\partial x_{ij}} = \beta_j\), we are describing the behavior of the statistical model. When \(x_{ij}\) changes by one unit, the model shifts \(\mu_i\) by \(\beta_j\). This does not mean that changing \(x_{ij}\) in the real world necessarily causes \(\mu_i\) to shift by \(\beta_j\).

9.3 Interactions

We can extend the regression model by including a product term \(x_{i1}x_{i2}\) in the model

\[ \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 (x_{i1}x_{i2}). \]

Differentiating \(\mu_i\) with respect to \(x_{i1}\) gives \(\frac{\partial \mu_i}{\partial x_{i1}} = \beta_1 + \beta_3 x_{i2}\). Thus, the effect of \(x_{i1}\) depends on the value of \(x_{i2}\).
Similarly, \(\frac{\partial \mu_i}{\partial x_{i2}} = \beta_2 + \beta_3 x_{i1}\). Thus the effect of \(x_{i2}\) depends on the value of \(x_{i1}\).

The interaction appears in the design matrix as an additional column that is the product of two other columns. For two predictors and ten observations the design matrix might be

\[ X = \begin{bmatrix} 1 & x_{11} & x_{12} & x_{11}x_{12} \\ 1 & x_{21} & x_{22} & x_{21}x_{22} \\ 1 & x_{31} & x_{32} & x_{31}x_{32} \\ 1 & x_{41} & x_{42} & x_{41}x_{42} \\ 1 & x_{51} & x_{52} & x_{51}x_{52} \\ 1 & x_{61} & x_{62} & x_{61}x_{62} \\ 1 & x_{71} & x_{72} & x_{71}x_{72} \\ 1 & x_{81} & x_{82} & x_{81}x_{82} \\ 1 & x_{91} & x_{92} & x_{91}x_{92} \\ 1 & x_{10,1} & x_{10,2} & x_{10,1}x_{10,2} \end{bmatrix} = \begin{bmatrix} 1 & 2.0 & 4.1 & 8.20 \\ 1 & 3.5 & 1.2 & 4.20 \\ 1 & 4.7 & 2.8 & 13.16 \\ 1 & 1.9 & 3.6 & 6.84 \\ 1 & 2.8 & 2.2 & 6.16 \\ 1 & 3.3 & 4.5 & 14.85 \\ 1 & 1.5 & 2.7 & 4.05 \\ 1 & 2.6 & 3.9 & 10.14 \\ 1 & 4.2 & 1.5 & 6.30 \\ 1 & 3.0 & 2.4 & 7.20 \end{bmatrix}. \]

When fitting the model, the product term is treated just like any other predictor; it is just another column in the design matrix.

9.4 Polynomials

We can extend the regression model by including polynomial terms. For this example, let’s include \(x_{i1}\) as a cubic polynomial (quadratic or higher order polynomials work as you expect). The model would be

\[ \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i1}^2 + \beta_3 x_{i1}^3 + \beta_4 x_{i2}, \]

and the terms \(x_{i1}^2\) and \(x_{i1}^3\) are treated just like additional predictors.

Differentiating with respect to \(x_{i1}\) gives

\[ \frac{\partial \mu_i}{\partial x_{i1}} = \beta_1 + 2\beta_2 x_{i1} + 3\beta_3 x_{i1}^2, \]

so the effect of \(x_{i1}\) depends on its own value when polynomial terms are included.

The polynomial terms appear in the design matrix as additional columns. For two predictors and ten observations the design matrix might be

\[ X = \begin{bmatrix} 1 & x_{i1} & x_{i1}^2 & x_{i1}^3 & x_{i2} \end{bmatrix}_{i=1}^{10} = \begin{bmatrix} 1 & 2.0 & 4.00 & 8.000 & 4.1 \\ 1 & 3.5 & 12.25 & 42.875 & 1.2 \\ 1 & 4.7 & 22.09 & 103.823 & 2.8 \\ 1 & 1.9 & 3.61 & 6.859 & 3.6 \\ 1 & 2.8 & 7.84 & 21.952 & 2.2 \\ 1 & 3.3 & 10.89 & 35.937 & 4.5 \\ 1 & 1.5 & 2.25 & 3.375 & 2.7 \\ 1 & 2.6 & 6.76 & 17.576 & 3.9 \\ 1 & 4.2 & 17.64 & 74.088 & 1.5 \\ 1 & 3.0 & 9.00 & 27.000 & 2.4 \end{bmatrix}. \]

The squared and cubic terms are simply new columns in the design matrix. When fitting the model, the these columns are treated just like any other predictor; they are just more columns in the design matrix.

9.5 Indicator Variables

We can include binary indicator variables in the regression model. Suppose \(x_{i1}\) and \(x_{i2}\) are numeric variables and \(x_{i3}\) is a binary indicator variable that equals 0 or 1. Then we would have the usual regression model

\[ \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3}. \]

However, the interpretation is interesting–the indicator variables work like a switch.

When the switch is off (\(x_{i3}=0\)), the \(\mu_i\) does not include \(\beta_3\).
When the switch is on (\(x_{i3}=1\)), \(\mu_i\) does include \(\beta_3\).

If \(x_{i3}=0\), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}\).
If \(x_{i3}=1\), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3\quad \leftarrow \text{(notice the extra bit on the end!)}\).

The coefficient \(\beta_3\) is the shift in \(\mu_i\) when the indicator variable equals 1 rather than 0.

The indicator appears in the design matrix as an additional column (of zeros and ones). For two numeric predictors, one indicator, and ten observations the design matrix might be

\[ X = \begin{bmatrix} 1 & x_{i1} & x_{i2} & x_{i3} \end{bmatrix}_{i=1}^{10} = \begin{bmatrix} 1 & 2.0 & 4.1 & 0 \\ 1 & 3.5 & 1.2 & 1 \\ 1 & 4.7 & 2.8 & 0 \\ 1 & 1.9 & 3.6 & 1 \\ 1 & 2.8 & 2.2 & 0 \\ 1 & 3.3 & 4.5 & 1 \\ 1 & 1.5 & 2.7 & 0 \\ 1 & 2.6 & 3.9 & 1 \\ 1 & 4.2 & 1.5 & 0 \\ 1 & 3.0 & 2.4 & 1 \end{bmatrix}. \]

The indicator variable is treated just like any other predictor; it is just another column in the design matrix.

9.6 Categorical Variables with More Than Two Categories

We can include categorical variables with more than two categories by creating multiple indicator variables. Suppose that \(x_{i1}\) and \(x_{i2}\) are numeric variables and our categorical variable has three categories, labeled \(A\), \(B\), and \(C\). We might define \(x_{i3}=1\) if the observation is in category \(B\) and \(0\) otherwise, and \(x_{i4}=1\) if the observation is in category \(C\) and \(0\) otherwise. The the model would be

\[ \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4}, \]

where category \(A\) is the baseline, with both indicators equal to zero.

If the observation is in category \(A\) (\(x_{i3}=0, x_{i4}=0\)), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}\).
If the observation is in category \(B\) (\(x_{i3}=1, x_{i4}=0\)), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3\).
If the observation is in category \(C\) (\(x_{i3}=0, x_{i4}=1\)), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_4\).

The coefficient \(\beta_3\) is the shift in \(\mu_i\) from category \(A\) to category \(B\). The coefficient \(\beta_4\) is the shift in \(\mu_i\) from category \(A\) to category \(C\).

As before, these indicators appear in the design matrix as additional columns. For two numeric predictors, one categorical variable with three categories, and ten observations the design matrix might be

\[ X = \begin{bmatrix} 1 & x_{i1} & x_{i2} & x_{i3} & x_{i4} \\ 1 & x_{i1} & x_{i2} & x_{i3} & x_{i4} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{i1} & x_{i2} & x_{i3} & x_{i4} \end{bmatrix}_{i=1}^{10} = \begin{bmatrix} 1 & 2.0 & 4.1 & 0 & 0 \\ 1 & 3.5 & 1.2 & 1 & 0 \\ 1 & 4.7 & 2.8 & 0 & 1 \\ 1 & 1.9 & 3.6 & 0 & 0 \\ 1 & 2.8 & 2.2 & 1 & 0 \\ 1 & 3.3 & 4.5 & 0 & 1 \\ 1 & 1.5 & 2.7 & 0 & 0 \\ 1 & 2.6 & 3.9 & 1 & 0 \\ 1 & 4.2 & 1.5 & 0 & 1 \\ 1 & 3.0 & 2.4 & 0 & 0 \end{bmatrix}. \]

On the baseline

A categorical variable with \(m\) categories requires \(m-1\) indicator variables in the model. One category is omitted and serves as the baseline. If we included all \(m\) indicators along with the intercept, the columns of the design matrix would be perfectly collinear. Omitting one category avoids this problem and ensures the model is identified.

9.7 Real Data

We can build a design matrix from penguins data in the {palmerpenguins} package, which contains data on penguins from several islands in Antarctica.

Suppose we want to model body mass as a function of flipper length and bill length. We can use a normal regression and model the mean \(\mu\) as

\[ \mu_i = \beta_0 + \beta_1 \texttt{flipper\_length}_i + \beta_2 \texttt{bill\_length}_i. \]

We can build the design matrix using cbind() by including a column of ones for the intercept and then the two predictors.

library(palmerpenguins)

X <- cbind(
  intercept = 1,
  flipper_length = penguins$flipper_length_mm,
  bill_length = penguins$bill_length_mm
) |> 
  na.omit()  # some observations are missing

head(X)

     intercept flipper_length bill_length
[1,]         1            181        39.1
[2,]         1            186        39.5
[3,]         1            195        40.3
[4,]         1            193        36.7
[5,]         1            190        39.3
[6,]         1            181        38.9

The first column of ones corresponds to the intercept. The second column is flipper length, and the third column is bill length. Each row corresponds to one penguin.