6 Fisher Information Matrix

We can easily use the log-likelihood function to obtain point estimates. It turns out, though, that the same log-likelihood function contains information that we can use to estimate the precision of those estimates as well.

As an example, consider the following two log-likelihood functions:

Which of these two log-likelihood functions do you think provides a more precise estimate?

Note: These likelihoods are from a normal model with unknown mean. I simulated 100 observations for $y_1$ and 300 observations for $y_2$. (I centered the data so the sample means both occurred exactly at 2).

Key Idea: We can use the curvature around the maximum likelihood estimate to get a sense of the uncertainty.

What quantity tells us about the amount of curvature at the maximum? The second derivative. As the second derivative becomes more negative (its magnitude increases), the curvature goes up. As the curvature goes up, the uncertainty goes down.

6.1 Example: Stylized Normal

In this example, we examine curvature for a single-parameter model.

To develop our intuition about “curvature” and confidence intervals, I analyze the Stylized Normal Model ($\sigma = 1$). Here, we model the data as a normal distribution with $\mu$ unknown (and to be estimated), but $\sigma = 1$ (known; not estimated). That is, $y \sim N(\mu, 1)$.

\[ \begin{aligned} \ell(\mu) &= -\tfrac{N}{2}\log(2\pi) - \tfrac{1}{2}\sum_{i = 1}^N (y_i - \mu)^2\\ \dfrac{\partial \ell(\mu)}{\partial \mu} &= \sum_{i = 1}^N (y_i - \mu) = \sum y_i - N\mu\\ \dfrac{\partial^2 \ell(\mu)}{\partial \mu^2} &= - N \end{aligned} \]

Facts:

As $N$ increases, $\dfrac{\partial^2 \ell(\mu \mid y)}{\partial \mu^2}$ becomes more negative.
As the magnitude of $\dfrac{\partial^2 \ell(\mu \mid y)}{\partial \mu^2}$ increases, the curvature increases.
As the curvature increases, the uncertainty decreases.

Wouldn’t it be really nice if we could use $\dfrac{\partial^2 \ell(\mu)}{\partial \mu^2}$ to estimate the standard error?

It turns out that this quantity is a direct, almost magically intuitive estimator of the standard error.

6.2 Theory

Definition 6.1 (Fisher Information) For a model $f(y \mid \theta)$ with log-likelihood $\ell(\theta) = \sum_{i=1}^N \log f(y_i \mid \theta)$, the expected information is defined as $\mathcal I(\theta) = -\mathbb E_\theta\!\left[\nabla^2 \ell(\theta)\right]$.

In practice, we often use the observed information, given by the negative Hessian of the log-likelihood at the ML estimate $\mathcal I_{\text{obs}}(\hat\theta) = -\nabla^2 \ell(\hat\theta)$.

There’s an important theoretical distinction between the expected and observed information. The expected Fisher information $\mathcal I(\theta) = \mathbb E_\theta[-\nabla^2 \ell(\theta)]$ is the expected curvature of the log-likelihood under the model. In contrast, the observed information $\mathcal I_{\text{obs}}(\hat\theta) = -\nabla^2 \ell(\hat\theta)$ uses the curvature at the ML estimate $\hat{\theta}$ from a particular sample. Under standard regularity conditions, both lead to the same large-sample variance: $\mathcal I_{\text{obs}}(\hat\theta) \overset{p}{\to} \mathcal I(\theta)$ and $\widehat{\operatorname{Var}}(\hat\theta) \approx \mathcal I_{\text{obs}}(\hat\theta)^{-1} \approx \mathcal I(\theta)^{-1}$. And in many examples, they are numerically identical. In some weird cases, the expected information is “more robust.” In these notes, I use the observed information to make the curvature↔︎uncertainty link concrete and intuitive.

Recall that $\nabla$ denotes the gradient vector of first derivatives of the log-likelihood with respect to the parameters,
\[ \nabla \ell(\theta) \;=\; \begin{bmatrix} \dfrac{\partial \ell}{\partial \theta_1} \\ \dfrac{\partial \ell}{\partial \theta_2} \\ \vdots \\ \dfrac{\partial \ell}{\partial \theta_k} \end{bmatrix}. \]
The Hessian $\nabla^2$ is the square matrix of second derivatives,
\[\small \nabla^2 \ell(\theta) \;=\; \begin{bmatrix} \dfrac{\partial^2 \ell}{\partial \theta_1^2} & \dfrac{\partial^2 \ell}{\partial \theta_1 \partial \theta_2} & \cdots & \dfrac{\partial^2 \ell}{\partial \theta_1 \partial \theta_k} \\ \dfrac{\partial^2 \ell}{\partial \theta_2 \partial \theta_1} & \dfrac{\partial^2 \ell}{\partial \theta_2^2} & \cdots & \dfrac{\partial^2 \ell}{\partial \theta_2 \partial \theta_k} \\ \vdots & \vdots & \ddots & \vdots \\ \dfrac{\partial^2 \ell}{\partial \theta_k \partial \theta_1} & \dfrac{\partial^2 \ell}{\partial \theta_k \partial \theta_2} & \cdots & \dfrac{\partial^2 \ell}{\partial \theta_k^2} \end{bmatrix}. \] These gradient and Hessian objects are the matrix calculus tools needed to define Fisher information.

Theorem 6.1 (Asymptotic Normality of ML Estimators) Let $y_1,\dots,y_N$ be iid from a model $f(y \mid \theta)$ with true parameter $\theta$. Suppose the regularity conditions in Theorem 1.1 hold and the Fisher information matrix $\mathcal I(\theta)$ is finite and positive definite. Then the maximum likelihood estimator $\hat\theta$ is asymptotically normal $\hat\theta \overset{a}{\sim} \mathcal N\big(\theta, \mathcal I(\theta)^{-1}\big)$.

This is an important result, because it tells us the location (i.e., $\theta$), variability (i.e., $\mathcal I(\theta)^{-1}$), and shape (i.e., normal) of the sampling distribution asymptotically. These asymptotic results tend to be good approximations in practice.

Theorem 6.2 (Asymptotic Variance of ML Estimators) Under the conditions of Theorem 6.1, the covariance of the maximum likelihood estimator is well-approximated by the inverse Fisher information $ () ;; I()^{-1}$.

In practice, we replace the unknown $\theta$ with the maximum likelihood estimate and use the observed information $\widehat{\operatorname{Var}}(\hat\theta) \;\approx\; \big[-\nabla^2 \ell(\hat\theta)\big]^{-1}$.

For a single parameter $\theta$, this reduces to $\widehat{\operatorname{Var}}(\hat\theta) \;\approx\; \left[\,-\,\dfrac{\partial^2 \ell(\theta)}{\partial \theta^2}\Bigg|_{\theta=\hat\theta}\right]^{-1}$.

Approximations via asymptotics

We should be careful here. The results above are asymptotic. As the sample size grows, the variance of $\hat{\theta}$ eventually converges to $\left[\left. - \dfrac{\partial^2 \ell(\theta)}{\partial \theta^2}\right| _{\theta = \hat{\theta}}\right] ^{-1}$. However, “eventually gets close” does not imply “is close” for a finite sample size. However, we are usually safe to treat asymptotic results as good approximations.

This means that we can estimate the SE as follows:

Find the second derivative (or Hessian if multiple parameters) of the log-likelihood.
Evaluate the second derivative at the maximum ($\theta = \hat{\theta}$).
Find the inverse. (That’s an estimate of the variance.)
Take the square root.

\[ \widehat{\text{SE}}(\hat{\theta}) = \sqrt{\left[\left. - \dfrac{\partial^2 \ell(\theta)}{\partial \theta^2}\right| _{\theta = \hat{\theta}}\right] ^{-1}} \]

If we continue the stylized normal example, we have the following.

\[ \begin{equation*} \dfrac{\partial^2 \ell(\mu \mid y)}{\partial \mu^2} = - N ~{\color{purple}{\Longrightarrow}}~ \left[\left. - \dfrac{\partial^2 \ell(\mu \mid y)}{\partial \mu^2}\right| _{\mu = \hat{\mu}}\right] ^{-1} = \dfrac{1}{N} \approx \widehat{\operatorname{Var}}(\hat{\mu}) \end{equation*} \]

And then

\[ \begin{equation*} \widehat{\text{SE}}(\hat{\mu}) \approx \sqrt{\dfrac{1}{N}} \end{equation*} \]

Does this answer make sense? What is the classic standard error of the mean when taking a random sample from a population? Hint: It’s $\text{SE}[\operatorname{avg}(y)] = \sigma/\sqrt{N}$. In this case, the “population SD” is $\sigma = 1$, as assumed by the stylized normal model.

6.3 Curvature in Multiple Dimensions

To add multiple dimensions, let’s consider the beta model. Our goal is to estimate $\alpha$ and $\beta$. The key is that we have multiple (i.e., two) parameters to estimate.

It’s a bit trickier to think about curvature in multiple dimensions. Instead of a second derivative like we have in a single dimension, we have Hessian matrix in multiple dimensions.

Here’s what the log-likelihood function might look like for a given data set. This is a raster and contour plot. The contour lines connect parameter combinations with the same log-likelihood. The color shows the log-likelihood — larger log-likelihoods are red and small log-likelihoods are blue.

To make more sense of this, here’s a version you can rotate.

The curvature around the maximum vertically tells us the variance in $\hat{\beta}$.

The curvature around the maximum horizontally tells us the variance in $\hat{\alpha}$.

But there’s a third direction that’s relevant here: the curvature diagonally. The diagonal curvature tells us the covariance of $\hat{\alpha}$ and $\hat{\beta}$. That is, if we over-estimate $\alpha$, how much do we tend to over-estimate (or under-estimate) $\beta$? (This reference line is illustrative; in practice, we read the covariance from the inverse information matrix.)

Rather than a single variance, we get a variance matrix (sometimes called the “covariance matrix” or the “variance–covariance matrix”). The diagonal entries are variances; the off-diagonal entries are covariances.

\[ \begin{equation*} \widehat{\operatorname{Var}}(\hat{\theta})= \widehat{\text{Cov}}(\hat{\theta}) \approx \left. \left[ \displaystyle \begin{matrix} - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_1^2} & - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_1 \partial \theta_2}\\ - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_2 \partial \theta_1} & - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_2^2}\\ \end{matrix}\right]^{-1} \right|_{\theta = \hat{\theta}} \end{equation*} \]

The elements along the diagonal (in red) are the variances for each parameter, so the square root of the diagonal gives you the standard errors. This is exactly what we’d expect.

\[ \begin{equation*} \widehat{\operatorname{Var}}(\hat{\theta}) \approx \left. \left[ \displaystyle \begin{matrix} \color{red}{- \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_1^2}} & - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_1 \partial \theta_2}\\ - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_2 \partial \theta_1} & \color{red}{- \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_2^2}}\\ \end{matrix}\right]^{-1} \right|_{\theta = \hat{\theta}} \end{equation*} \]

The off-diagonal elements (in blue) are the covariances—they’ll be really important to us later, but we don’t have a direct use for them at the moment.

\[ \begin{equation*} \widehat{\operatorname{Var}}(\hat{\theta}) \approx \left. \left[ \displaystyle \begin{matrix} - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_1^2} & \color{blue}{- \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_1 \partial \theta_2}}\\ \color{blue}{- \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_2 \partial \theta_1}} & - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_2^2}\\ \end{matrix}\right]^{-1} \right|_{\theta = \hat{\theta}} \end{equation*} \]

6.4 More than Two Parameters

But what about more than two parameters? It’s exactly what you’d expect.

\[ \begin{equation*} \begin{aligned} \widehat{\operatorname{Var}}(\hat{\theta}) &\approx \left. \left[ \displaystyle \begin{matrix} - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_1^2} & - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_1 \partial \theta_2} & \ldots &- \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_1 \partial \theta_k}\\ - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_2 \partial \theta_1} & - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_2^2} & \ldots & - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_2 \partial \theta_k}\\ \vdots & \vdots & \ddots & \vdots \\ - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_k \partial \theta_1} & - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_k \partial \theta_2} & \ldots & - \dfrac{\partial^2 \ell(\theta\mid y)}{\partial \theta_k^2}\\ \end{matrix}\right]^{-1} \right|_{\theta = \hat{\theta}}\\ & \approx \mathcal{I}_{\text{obs}}(\theta)^{-1}\big|_{\theta = \hat{\theta}}\\ &\approx \mathcal{I}_{\text{obs}}(\hat{\theta})^{-1} \end{aligned} \end{equation*} \]

6.5 From Curvature to Wald Confidence Intervals

As the sample size grows large, the ML estimate converges to a normally distributed random variable with mean $\theta$ and variance $\mathcal{I}(\theta)^{-1}$.

In practice, we’ll take this to mean it’s approximately normal, which justifies the usual Wald procedure of creating a 90% confidence interval by hopping 1.64 SEs to the left and right of the estimate.

Recall that $\widehat{\text{SE}}(\hat{\theta}) = \sqrt{\left[\left. - \dfrac{\partial^2 \ell(\theta)}{\partial \theta^2}\right| _{\theta = \hat{\theta}}\right]^{-1}}$.

\[ \begin{align*} 90\%~\text{C.I.} &= \hat{\theta} \pm 1.64 \cdot \widehat{\text{SE}}(\hat{\theta})\\ 95\%~\text{C.I.} &= \hat{\theta} \pm 1.96 \cdot \widehat{\text{SE}}(\hat{\theta}) \end{align*} \]

To work with these intervals, then, we just need the variance matrix $\widehat{\operatorname{Var}}(\hat{\theta}) = \mathcal{I}_{\text{obs}}(\hat{\theta})^{-1}$.

6.6 Example: Exponential Model

In this example, we compute the standard error for the rate parameter in the exponential model. The exponential pdf is $f(y_i \mid \lambda) = \lambda \exp(-\lambda y_i)$ for $y_i \ge 0$ and the ML estimate is $\hat{\lambda} = \dfrac{1}{\operatorname{avg}(y)}$ (from earlier).

The log-likelihood is

\[ \begin{aligned} \ell(\lambda) &= \sum_{i=1}^N \log \left[ \lambda \exp(-\lambda y_i) \right]\\ &= \sum_{i=1}^N [\log \lambda - \lambda y_i]\\ &= N \log \lambda - \lambda \sum_{i=1}^N y_i. \end{aligned} \]

The first derivative of the log-likelihood (called the “score function”) is $\dfrac{\partial \ell(\lambda)}{\partial \lambda} = \dfrac{N}{\lambda} - \sum_{i=1}^N y_i$.

The second derivative is $\dfrac{\partial^2 \ell(\lambda)}{\partial \lambda^2} = -\dfrac{N}{\lambda^2}$.

Set the first derivative equal to zero to obtain the ML estimate $\hat{\lambda} = \dfrac{N}{\sum_{i=1}^N y_i} = \dfrac{1}{\operatorname{avg}(y)}$.

Evaluate the second derivative at the maximum to get the observed information

\[ \mathcal I_{\text{obs}}(\hat{\lambda}) = \left.-\dfrac{\partial^2 \ell(\lambda)}{\partial \lambda^2}\right|_{\lambda = \hat{\lambda}} = \dfrac{N}{\hat{\lambda}^2}. \]

Invert to estimate the variance $\widehat{\operatorname{Var}}(\hat{\lambda}) \approx \dfrac{\hat{\lambda}^2}{N}$. Take the square root to get the standard error $\widehat{\text{SE}}(\hat{\lambda}) \approx \dfrac{\hat{\lambda}}{\sqrt{N}}$.

And just as before, we use these SE estimates to create confidence intervals.

\[ \begin{aligned} 90\%~\text{C.I.} &= \hat{\lambda} \pm 1.64 \dfrac{\hat{\lambda}}{\sqrt{N}},\\ 95\%~\text{C.I.} &= \hat{\lambda} \pm 1.96 \dfrac{\hat{\lambda}}{\sqrt{N}}. \end{aligned} \]

Comment: Notice how the curvature $(-N/\lambda^2)$ gets steeper as $N$ grows, which means the uncertainty in $\hat{\lambda}$ shrinks, exactly as we would expect.

6.7 Beta model and `optim()`

Just like optim() can use a numerical hill-climbing algorithm to find and return the par that maximizes the fun, it can also use a numerical algorithm to find and return the hessian.

6.7.1 The logic of `optim()`’s `hessian`

At the maximum likelihood estimate, the first derivative is zero. That means we are sitting right on top of the hill. At that precision point, the slope is flat. To find the Hessian, we need to know how quickly the slope changes as we move away from the maximum. That “rate of change of the slope” is exactly the second derivative. In multiple dimensions, we call this matrix of second derivatives the Hessian.

How does optim() find this Hessian? It does not derive it with calculus. Instead, it uses numerical differences. Starting at the maximum, the first derivative is zero. optim() nudges each parameter slightly up and down. After each nudge, it checks how the slope changes in response. The pattern of those changes (i.e., the changes in the slopes) is the Hessian. optim() is just poking the surface around the maximum and carefully watching how the flat spot bends.

6.7.2 Lahman’s `batting_average` data

To see this in action, we can use data on batting average in Major League Baseball that we saw in Section 1.5 and Section 3.2.

# load packages
library(tidyverse)
library(Lahman)  # data from Lahman's baseball database

# create data frame with batting average
bstats <- battingStats() |> 
  filter(yearID == 2023) |>  # data from 2023
  filter(AB >= 100) |>  # players with at least 100 at-bats
  select(player_id = playerID, batting_average = BA) |>
  arrange(-batting_average) |>
  na.omit() |>
  glimpse()

Rows: 457
Columns: 2
$ player_id       <chr> "arraelu01", "acunaro01", "freemfr01", "diazya01", "se…
$ batting_average <dbl> 0.354, 0.337, 0.331, 0.330, 0.327, 0.319, 0.316, 0.312…

# plot histogram
hist(bstats$batting_average)

Figure 6.1: A histogram of batting average in Lahman’s data. Data from 2023 onward; includes players with at least 100 at-bats.

6.7.3 Computing `hessian` with `optim()`

To model these data with a beta distribution and estimate the covariance matrix, we can modify the R code from before.

# log-likelihood function (using dbeta!)
beta_ll_fn <- function(theta, y) { 
  alpha <- theta[1] 
  beta  <- theta[2] 
1  ll <- sum(dbeta(y, shape1 = alpha, shape2 = beta, log = TRUE))
  return(ll)
}

# function to fit beta model 
est_beta <- function(y) {
  # use optim; compute hessian
  est <- optim(
    par     = c(2, 2),  # decent starting values for the problem below
    fn      = beta_ll_fn,
    y       = y,
2    control = list(fnscale = -1),
    method  = "BFGS",
3    hessian = TRUE
  ) 
  
  # compute an estimate of covariance matrix (slowly, this first time)
4  info_obs <- -est$hessian  # notice negative sign
  var_hat  <- solve(info_obs)
  
  # check convergence; print warning if needed
  if (est$convergence != 0) print("Model did not converge!")
  
  # return list of elements
  res <- list(theta_hat = est$par, 
              var_hat   = var_hat)
  return(res)
}

1: Compute the log-likelihood. dbeta(..., log = TRUE) returns $\log f(y_i \mid \alpha, \beta)$. Summing produces $\ell(\alpha,\beta) \;=\; \sum_{i=1}^N \log f(y_i \mid \alpha,\beta)$. We work with the log-likelihood because (1) logging turns numerically unstable products into stable sums and (2) the Hessian of the log-likelihood is useful for estimating the variance.
2: Use optim() to maximize $\ell$. fnscale = -1 makes optim() minimize $-\ell(\theta)$, which is equivalent to maximizing $\ell(\theta)$.
3: Ask for the Hessian. hessian = TRUE tells optim() to compute a numeric Hessian at the maximum $\hat\theta = (\hat\alpha,\hat\beta)$ by nudging the function near the maximum and measuring how the slope changes in response to these small nudges. Conceptually, this captures the curvature of the log-likelihood surface at the peak. Computing this Hessian is computationally costly, so this argument defaults to FALSE. Here’s an important detail: The Hessian returned is for fn; fnscale only changes the optimization target, not the reported Hessian. With fnscale = -1, optim() still returns the Hessian of fn (the log-likelihood), so the observed information is -est$hessian.
4: Compute the observed information at the ML estimate. est$hessian is the Hessian of $-\ell$ at $\hat\theta$. $\texttt{info\_obs} = -\texttt{est\$hessian} = -\nabla^2 \ell(\hat\theta) = \mathcal I_{\text{obs}}(\hat\theta)$. This is the $2\times2$ matrix

\[ \mathcal I_{\text{obs}}(\hat\theta) = \begin{bmatrix} -\dfrac{\partial^2 \ell}{\partial \alpha^2} & -\dfrac{\partial^2 \ell}{\partial \alpha\,\partial \beta}\\[6pt] -\dfrac{\partial^2 \ell}{\partial \beta\,\partial \alpha} & -\dfrac{\partial^2 \ell}{\partial \beta^2} \end{bmatrix}_{\theta=\hat\theta}. \]

Invert hessian to get variance. Asymptotic theory tells us that $\widehat{\operatorname{Var}}(\hat\theta) \approx \mathcal I_{\text{obs}}(\hat\theta)^{-1}$. So solve(info_obs) returns the estimated covariance matrix for $(\hat\alpha,\hat\beta)$. The diagonal entries are $\widehat{\operatorname{Var}}(\hat\alpha)$ and $\widehat{\operatorname{Var}}(\hat\beta)$. The square roots are the standard errors. The off-diagonal entry is $\widehat{\text{Cov}}(\hat\alpha,\hat\beta)$.
Return var_hat. We now have two items to return() after fitting the model. The parameter estimate vector theta_hat and the estimated covariance matrix var_hat.

6.7.4 Fitting the beta model

We can now use our function to fit the model and print the parameter estimates and covariance matrix.

# estimate beta model using the batting average data
fit <- est_beta(bstats$batting_average)
fit$theta_hat  # parameter estimates

[1]  37.07655 114.92550

fit$var_hat  # covariance matrix estimates

          [,1]     [,2]
[1,]  5.964783 18.40870
[2,] 18.408705 57.83667

sqrt(diag(fit$var_hat))  # standard error estimates

[1] 2.442291 7.605042

fit$theta_hat[1] + 1.64*c(-1, 1)*sqrt(diag(fit$var_hat))[1] # 90% ci for alpha

[1] 33.07119 41.08191

fit$theta_hat[2] + 1.64*c(-1, 1)*sqrt(diag(fit$var_hat))[2] # 90% ci for beta

[1] 102.4532 127.3978