We begin with the normal linear model. We assume that \(y_i \sim \mathcal{N}(\mu_i, \sigma^2)\), where \(\mu_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik}\).
This is equivalent to writing \(y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik} + \varepsilon_i\) with \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\). These two forms are the same. The first emphasizes the probability model, the other looks like the regression equation you know from econometrics.
If we create a a column vector of coefficients \(\beta = [\beta_0, \beta_1, ..., \beta_k]^\top\) and create a matrix \(X\) by stacking a column of ones and the predictors side-by-side
then the mean for observation \(i\) is \(\mu_i = X_i \beta\) and the entire vector of means is \(\mu = X\beta\). This matrix \(X\) is called the design matrix.
We can then write the model as \(y_i \sim \mathcal{N}(\mu_i, \sigma^2)\), where \(\mu_i = X_i\beta\).
9.2 Scalar and Matrix Notation
9.2.1 A Single Observation
Consider a simple regression with two predictors. For observation \(i\), the scalar form is
Each row \(X_i\) corresponds to the predictors for observation \(i\) and \(X_i\beta\) is equivalent to the scalar regression equation for observation \(i\)..
9.2.3 General Form with \(k\) Predictors
The same idea extends to \(k\) predictors. The \(i\)th row of \(X\) is
which is exactly the scalar regression equation. Stacking the \(n\) rows gives the design matrix \(X\) of dimension \(n \times (k+1)\). There is nothing new here; the matrix notation simply collects all the scalar equations into one compact expression.
9.2.4 The Intercept
The intercept corresponds to the first column of ones in the design matrix. For ten observations with two predictors we can write
Differentiating \(\mu_i\) with respect to \(x_{i1}\) gives \(\frac{\partial \mu_i}{\partial x_{i1}} = \beta_1\).
Differentiating \(\mu_i\) with respect to \(x_{i2}\) gives \(\frac{\partial \mu_i}{\partial x_{i2}} = \beta_2\).
Thus, a one-unit increase in \(x_{ij}\) changes \(\mu_i\) by \(\beta_j\), holding the other predictor.
On “effects” in regression models
When we say that \(\frac{\partial \mu_i}{\partial x_{ij}} = \beta_j\), we are describing the behavior of the statistical model. When \(x_{ij}\) changes by one unit, the model shifts \(\mu_i\) by \(\beta_j\). This does not mean that changing \(x_{ij}\) in the real world necessarily causes \(\mu_i\) to shift by \(\beta_j\).
9.3 Interactions
We can extend the regression model by including a product term \(x_{i1}x_{i2}\) in the model
Differentiating \(\mu_i\) with respect to \(x_{i1}\) gives \(\frac{\partial \mu_i}{\partial x_{i1}} = \beta_1 + \beta_3 x_{i2}\). Thus, the effect of \(x_{i1}\) depends on the value of \(x_{i2}\).
Similarly, \(\frac{\partial \mu_i}{\partial x_{i2}} = \beta_2 + \beta_3 x_{i1}\). Thus the effect of \(x_{i2}\) depends on the value of \(x_{i1}\).
The interaction appears in the design matrix as an additional column that is the product of two other columns. For two predictors and ten observations the design matrix might be
When fitting the model, the product term is treated just like any other predictor; it is just another column in the design matrix.
9.4 Polynomials
We can extend the regression model by including polynomial terms. For this example, let’s include \(x_{i1}\) as a cubic polynomial (quadratic or higher order polynomials work as you expect). The model would be
The squared and cubic terms are simply new columns in the design matrix. When fitting the model, the these columns are treated just like any other predictor; they are just more columns in the design matrix.
9.5 Indicator Variables
We can include binary indicator variables in the regression model. Suppose \(x_{i1}\) and \(x_{i2}\) are numeric variables and \(x_{i3}\) is a binary indicator variable that equals 0 or 1. Then we would have the usual regression model
However, the interpretation is interesting–the indicator variables work like a switch.
When the switch is off (\(x_{i3}=0\)), the \(\mu_i\)does not include \(\beta_3\).
When the switch is on (\(x_{i3}=1\)), \(\mu_i\)does include \(\beta_3\).
If \(x_{i3}=0\), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}\).
If \(x_{i3}=1\), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3\quad \leftarrow \text{(notice the extra bit on the end!)}\).
The coefficient \(\beta_3\) is the shift in \(\mu_i\) when the indicator variable equals 1 rather than 0.
The indicator appears in the design matrix as an additional column (of zeros and ones). For two numeric predictors, one indicator, and ten observations the design matrix might be
The indicator variable is treated just like any other predictor; it is just another column in the design matrix.
9.6 Categorical Variables with More Than Two Categories
We can include categorical variables with more than two categories by creating multiple indicator variables. Suppose that \(x_{i1}\) and \(x_{i2}\) are numeric variables and our categorical variable has three categories, labeled \(A\), \(B\), and \(C\). We might define \(x_{i3}=1\) if the observation is in category \(B\) and \(0\) otherwise, and \(x_{i4}=1\) if the observation is in category \(C\) and \(0\) otherwise. The the model would be
where category \(A\) is the baseline, with both indicators equal to zero.
If the observation is in category \(A\) (\(x_{i3}=0, x_{i4}=0\)), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}\).
If the observation is in category \(B\) (\(x_{i3}=1, x_{i4}=0\)), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3\).
If the observation is in category \(C\) (\(x_{i3}=0, x_{i4}=1\)), then \(\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_4\).
The coefficient \(\beta_3\) is the shift in \(\mu_i\) from category \(A\) to category \(B\). The coefficient \(\beta_4\) is the shift in \(\mu_i\) from category \(A\) to category \(C\).
As before, these indicators appear in the design matrix as additional columns. For two numeric predictors, one categorical variable with three categories, and ten observations the design matrix might be
A categorical variable with \(m\) categories requires \(m-1\) indicator variables in the model. One category is omitted and serves as the baseline. If we included all \(m\) indicators along with the intercept, the columns of the design matrix would be perfectly collinear. Omitting one category avoids this problem and ensures the model is identified.
9.7 Real Data
We can build a design matrix from penguins data in the {palmerpenguins} package, which contains data on penguins from several islands in Antarctica.
Suppose we want to model body mass as a function of flipper length and bill length. We can use a normal regression and model the mean \(\mu\) as
The first column of ones corresponds to the intercept. The second column is flipper length, and the third column is bill length. Each row corresponds to one penguin.