7  Delta Method

The invariance property tells us that if \(\hat{\theta}\) is the ML estimate of \(\theta\), then \(\hat\tau = \tau(\hat{\theta})\) is the ML estimate of \(\tau(\theta)\). But this raises a new question: how do we obtain the standard error of \(\hat\tau = \tau(\hat{\theta})\)? We have estimated the SE of \(\hat{\theta}\), but how do we convert that to an estimate of the SE of \(\hat\tau = \tau(\hat{\theta})\)?

Key Idea: The delta method uses a Taylor expansion to approximate the variance of \(\hat\tau = \tau(\hat{\theta})\) using the variance of \(\hat{\theta}\).

A Taylor expansion uses derivatives to approximate a function near a particular point. For a one-variable function \(\tau(\theta)\) expanded around \(\hat{\theta}\),

\[ \small \tau(\theta) \;\approx\; \tau(\hat{\theta}) + \tau'(\hat{\theta})(\theta - \hat{\theta}) + \tfrac{1}{2}\tau''(\hat{\theta})(\theta - \hat{\theta})^2 + \cdots \]

When \(\theta\) is close to \(\hat{\theta}\), the higher-order terms are small. Dropping them gives the linear approximation \[ \small \tau(\theta) \;\approx\; \tau(\hat{\theta}) + \tau'(\hat{\theta})(\theta - \hat{\theta}). \]

Thus, small deviations of \(\theta\) from \(\hat{\theta}\) are scaled by the slope \(\tau'(\hat{\theta})\). The delta method captures this scaling \[ \small \operatorname{Var}\!\big(\tau(\hat{\theta})\big) \;\approx\; \big(\tau'(\hat{\theta})\big)^2 \, \operatorname{Var}(\hat{\theta}). \] The square comes from the rule \[ \small \operatorname{Var}(cX) = c^2 \operatorname{Var}(X). \]

Intuitively, you can think of a Taylor expansion as “zooming in” until the curve looks straight. Near \(\theta\), small wiggles in \(\hat{\theta}\) are expanded or shrunk by \(\tau(\cdot)\) depending on the slope. The delta method adjusts for this shrinking or expanding.

In the one-parameter case, \(\widehat{\operatorname{Var}}[\tau(\hat{\theta})] \approx \Big(\tau'(\hat{\theta})\Big)^2 \cdot \widehat{\operatorname{Var}}(\hat{\theta})\).

In the multi-parameter case, if \(\hat{\theta}\) is a vector and \(\tau(\cdot)\) is a function, then \[ \widehat{\operatorname{Var}}(\hat\tau) = \widehat{\operatorname{Var}}[\tau(\hat{\theta})] \approx \nabla \tau(\hat{\theta})^\top \cdot \widehat{\operatorname{Var}}(\hat{\theta}) \cdot \nabla \tau(\hat{\theta}), \] where \(\nabla \tau(\hat{\theta})\) is the gradient of \(\tau(\theta)\) with respect to \(\theta\) evaluated at \(\hat\theta\).

From there, we can create Wald confidence intervals in the usual way. Recall that \(\widehat{\text{SE}}(\hat{\tau}) = \sqrt{\widehat{\operatorname{Var}}(\hat\tau)}\).

\[ \begin{align*} 90\%~\text{C.I.} &= \hat{\tau} \pm 1.64 \cdot \widehat{\text{SE}}(\hat{\tau})\\ 95\%~\text{C.I.} &= \hat{\tau} \pm 1.96 \cdot \widehat{\text{SE}}(\hat{\tau}) \end{align*} \]

7.1 Bernoulli: From \(\pi\) to odds

Suppose a Bernoulli model of a binary outcome \(y\). The ML estimate of \(\pi\) is \(\hat{\pi} = \text{avg}(y)\). Using the Fisher information, the variance of \(\hat{\pi}\) is \(\widehat{\operatorname{Var}}(\hat{\pi}) = \dfrac{\hat{\pi}(1 - \hat{\pi})}{N}\).

Suppose we want to transform \(\pi\) to the odds, where \(\text{odds} = \tau(\pi) = \dfrac{\pi}{1 - \pi}\). \(\tau(\pi) = \dfrac{\pi}{1 - \pi}\). We use the invariance property to obtain an ML estimate \(\widehat{\text{odds}} = \dfrac{\hat\pi}{1 - \hat\pi}\).

To estimate the variance of \(\widehat{\text{odds}}\), we need to use the delta method.

First, find the first derivative of \(\tau(\pi)\). \(\tau'(\pi) = \dfrac{1}{(1 - \pi)^2}\).

Plugging in, we have

\[ \begin{aligned} \widehat{\operatorname{Var}}(\widehat{\text{odds}}) &\approx \left(\dfrac{1}{(1 - \hat{\pi})^2}\right)^2 \cdot \dfrac{\hat{\pi}(1 - \hat{\pi})}{N}\\[6pt] &= \dfrac{1}{(1 - \hat{\pi})^4} \cdot \dfrac{\hat{\pi}(1 - \hat{\pi})}{N}\\[6pt] &= \dfrac{\hat{\pi}(1 - \hat{\pi})}{N(1 - \hat{\pi})^4}\\[6pt] &= \dfrac{\hat{\pi}}{N(1 - \hat{\pi})^3}. \end{aligned} \]

And the standard error is

\[ \widehat{\text{SE}}(\widehat{\text{odds}}) \;\approx\; \sqrt{\dfrac{\hat{\pi}}{N(1 - \hat{\pi})^3}}. \]

Assuming a data set with 150 trials and 8 successes, we have the following estimates.

# number of trials and successes
n <- 150
y <- 8

# estimate pi
pi_hat <- y / n
pi_hat
[1] 0.05333333
# variance and SE estimate for pi_hat
var_pi_hat <- pi_hat * (1 - pi_hat) / n
se_pi_hat  <- sqrt(var_pi_hat)
se_pi_hat
[1] 0.01834646
# transform to estimate odds
odds_hat <- pi_hat / (1 - pi_hat)
odds_hat
[1] 0.05633803
# variance and SE for odds_hat (delta method)
var_odds_hat <- pi_hat / (n * (1 - pi_hat)^3)
se_odds_hat  <- sqrt(var_odds_hat)
se_odds_hat
[1] 0.0204719

7.2 Poisson: From \(\lambda\) to SD

Suppose a Poisson model of a count variable \(y\). The ML estimate is \(\hat{\lambda} = \text{avg}(y)\). We can use the information matrix to estimate the variance \(\widehat{\operatorname{Var}}(\hat{\lambda}) = \dfrac{\hat{\lambda}}{N}\).

Suppose we want to estimate the standard deviation of the Poisson distribution, which is \(\sigma = \sqrt{\lambda}\). Then \(\sigma = \tau(\lambda) = \sqrt{\lambda}\) and \(\tau'(\lambda) = \dfrac{1}{2\sqrt{\lambda}}\). Then, by the delta method, \(\widehat{\operatorname{Var}}(\hat{\sigma}) \approx \left(\dfrac{1}{2\sqrt{\hat{\lambda}}}\right)^2 \cdot \dfrac{\hat{\lambda}}{N}\).

Simplifying, we have \(\widehat{\operatorname{Var}}(\hat{\sigma}) \approx \dfrac{1}{4N}\) and \(\widehat{\text{SE}}(\hat{\sigma}) \;\approx\; \dfrac{1}{2\sqrt{N}}\).

Notice: This result does not depend on \(\hat{\lambda}\); this is perhaps unexpected. We can check our work with a quick simulation.

# fix sample size, vary lambda
n <- 100
lambda_vals <- c(2, 10, 50)

# store results
results <- data.frame(lambda = lambda_vals, se_sigma_mc = NA)

for (j in 1:length(lambda_vals)) {
  lambda_j <- lambda_vals[j]  # current value of lambda
  sigma_hats <- numeric(10000)   # container for 10,000 estimates
  
  for (i in 1:10000) {
    y <- rpois(n, lambda_j)
    lambda_hat <- mean(y)
    sigma_hats[i] <- sqrt(lambda_hat)
  }
  
  results$se_sigma_mc[j] <- sd(sigma_hats)
}

1/(2*sqrt(n))  # delta method SE
[1] 0.05
results  # monte carlo SE
  lambda se_sigma_mc
1      2  0.04981397
2     10  0.05012129
3     50  0.05006480

7.3 Beta Example: From \((\alpha, \beta)\) to \(\mu\)

Suppose a beta model of a continuous variable \(y\) that lies strictly between zero and one. We used optim() to estimate the parameters and their variances.

# log-likelihood function (using dbeta!)
beta_ll_fn <- function(theta, y) { 
  alpha <- theta[1] 
  beta  <- theta[2] 
  ll <- sum(dbeta(y, shape1 = alpha, shape2 = beta, log = TRUE))
  return(ll)
}

# function to fit beta model 
est_beta <- function(y) {
  # use optim; compute hessian
  est <- optim(
    par     = c(2, 2),  # decent starting values for the problem below
    fn      = beta_ll_fn,
    y       = y,
    control = list(fnscale = -1),  
    method  = "BFGS",
    hessian = TRUE            
  ) 
  
  # compute an estimate of covariance matrix (slowly, this first time)
  info_obs <- -est$hessian  # notice negative sign
  var_hat  <- solve(info_obs) 
  
  # check convergence; print warning if needed
  if (est$convergence != 0) print("Model did not converge!")
  
  # return list of elements
  res <- list(theta_hat = est$par, 
              var_hat   = var_hat) 
  return(res)
}

# load packages
library(tidyverse)
library(Lahman)  # data from Lahman's baseball database

# create data frame with batting average
bstats <- battingStats() |> 
  filter(yearID == 2023) |>  # data from 2023
  filter(AB >= 100) |>  # players with at least 100 at-bats
  select(player_id = playerID, batting_average = BA) |>
  arrange(-batting_average) |>
  na.omit() |>
  glimpse()
Rows: 457
Columns: 2
$ player_id       <chr> "arraelu01", "acunaro01", "freemfr01", "diazya01", "se…
$ batting_average <dbl> 0.354, 0.337, 0.331, 0.330, 0.327, 0.319, 0.316, 0.312…
# estimate beta model using the batting average data
fit <- est_beta(bstats$batting_average)
fit$theta_hat  # parameter estimates
[1]  37.07655 114.92550
fit$var_hat  # covariance matrix estimates
          [,1]     [,2]
[1,]  5.964783 18.40870
[2,] 18.408705 57.83667

But the \(\alpha\) and \(\beta\) parameters are not easy to interpret. Suppose we want the mean \(\mu = \dfrac{\alpha}{\alpha + \beta}\). We can use the invariance property.

# while it's not as descriptive, the formulas for the mean
#   and variance as a function of alpha and beta are quite 
#   long, so I use a and b for compactness rather than the
#   more descriptive a_hat or alpha_hat variants.
a <- fit$theta_hat[1]  # alpha_hat
b <- fit$theta_hat[2] # beta_hat

# mu_hat via invariance property
mu_hat <- a/(a + b)  
mu_hat
[1] 0.2439214

But we also might like to estimate the SE for \(\hat\mu\). We can do this with the delta method. Recall that the delta method is \(\widehat{\operatorname{Var}}[\tau(\hat{\theta})] \approx \nabla \tau(\hat{\theta})^\top \cdot \widehat{\operatorname{Var}}(\hat{\theta}) \cdot \nabla \tau(\hat{\theta})\). This has two parts: the covariance matrix \(\widehat{\operatorname{Var}}(\hat{\theta})\) and gradient \(\nabla \tau(\hat{\theta})\).

7.3.1 The covariance matrix \(\widehat{\operatorname{Var}}(\hat{\theta})\)

In this case, \(\widehat{\operatorname{Var}}(\hat{\theta}) =\widehat{\operatorname{Var}}(\hat{\alpha}, \hat{\beta})\) is computed by our est_beta() function numerically, so this part is ready to go

fit$var_hat
          [,1]     [,2]
[1,]  5.964783 18.40870
[2,] 18.408705 57.83667

7.3.2 The gradient \(\nabla \tau(\hat{\theta})\)

The gradient of \(\mu = \tau(\alpha, \beta) = \frac{\alpha}{\alpha + \beta}\) is:

\[ \nabla \tau(\alpha,\beta) = \begin{bmatrix} \dfrac{\partial \tau}{\partial \alpha}\\[6pt] \dfrac{\partial \tau}{\partial \beta} \end{bmatrix} = \begin{bmatrix} \dfrac{\beta}{(\alpha+\beta)^2}\\[6pt] -\dfrac{\alpha}{(\alpha+\beta)^2} \end{bmatrix} \]

Plugging in \(\hat\alpha\) and \(\hat\beta\), we have

\[ \nabla \tau(\hat\alpha, \hat\beta) =\begin{bmatrix}\dfrac{\hat\beta}{(\hat\alpha+\hat\beta)^2}\\-\dfrac{\hat\alpha}{(\hat\alpha+\hat\beta)^2}\end{bmatrix} = \begin{bmatrix}\dfrac{114.93}{(37.08+114.93)^2}\\[6pt]-\dfrac{37.08}{(37.08+114.93)^2}\end{bmatrix} = \begin{bmatrix} 0.0050 \\ -0.0016\end{bmatrix} \]

7.3.3 The matrix algebra

7.3.3.1 Using R

This is a simple calculation with R.

# create gradient using a and b from above
grad <- c(b/(a + b)^2, 
          -a/(a + b)^2)
var_hat_mu <- grad %*% fit$var_hat %*% grad
var_hat_mu
             [,1]
[1,] 2.637491e-06
se_hat_mu <- sqrt(var_hat_mu)
se_hat_mu
            [,1]
[1,] 0.001624035

Note: R treats the two-element numeric vector as a 2x1 matrix or 1x2 matrix as necessary to make the matrix multiplication conformable. This avoids the need to explicitly transpose. We could also make grad a 2x1 column matrix and explicitly transpose if we wanted.

grad <- matrix(c(b/(a + b)^2, -a/(a + b)^2), 
               nrow = 2)
grad
             [,1]
[1,]  0.004974134
[2,] -0.001604724
var_hat_mu <- t(grad) %*% fit$var_hat %*% grad
var_hat_mu
             [,1]
[1,] 2.637491e-06

7.3.3.2 “By hand”

To remind us what R is doing with t(grad) %*% fit$var_hat %*% grad, here is what the matrix multiplication looks like “by hand.”

\[ \begin{aligned} \widehat{\mathrm{Var}}(\widehat{\mu}) &\approx \overbrace{\begin{bmatrix} 0.0050 & -0.0016 \end{bmatrix}}^{\nabla\tau(\hat\alpha,\hat\beta)^{\!\top}} \Biggl( \underbrace{\begin{bmatrix} 5.96 & 18.41\\[2pt] 18.41 & 57.84 \end{bmatrix}}_{\widehat{\mathrm{Var}}(\hat\alpha,\hat\beta)} \underbrace{\begin{bmatrix} 0.0050\\[2pt] -0.0016 \end{bmatrix}}_{\nabla\tau(\hat\alpha,\hat\beta)} \Biggr) && \text{plug in values} \\[10pt] &= \overbrace{\begin{bmatrix} 0.0050 & -0.0016 \end{bmatrix}}^{\nabla\tau^{\!\top}} \begin{bmatrix} 5.96(0.0050) + 18.41(-0.0016)\\[2pt] 18.41(0.0050) + 57.84(-0.0016) \end{bmatrix} && \text{multiply RHS; (2 x 2) x (2 x 1)} \\[10pt] &= \overbrace{\begin{bmatrix} 0.0050 & -0.0016 \end{bmatrix}}^{\nabla\tau^{\!\top}} \begin{bmatrix} 0.00037\\[2pt] -0.0010 \end{bmatrix} && \text{simplify} \\[10pt] &= 0.0050\cdot 0.00037 \;+\; (-0.0016)\cdot(-0.0010) && \text{multiply; (1 x 2) x (2 x 1)} \\[6pt] &\approx 0.0000019 \;+\; 0.0000016 && \text{simplify} \\[6pt] &= 0.0000026 && \text{simplify} \end{aligned} \]

We can compare this to the classical mean and SE estimate. The classical and beta model estimate and SE are very similar, but not identical.

# classical mean
mean(bstats$batting_average)
[1] 0.2439672
# classical SE
sd(bstats$batting_average)/sqrt(length(bstats$batting_average))
[1] 0.001589904
Model How mean and SE are estimated Mean SE
Classical $\hat\mu = \operatorname{avg}(y)$. $\widehat{\text{SE}} = \frac{\operatorname{SD}(y)}{\sqrt{N}}$. 0.24397 0.001590
Beta model Estimate $\hat\alpha$ and $\hat\beta$ with ML. Estimate variance of $\hat\alpha$ and $\hat\beta$ with information matrix. Estimate $\hat\mu$ with invariance property. Estimate SE using delta method. 0.24392 0.001624

7.3.4 Numerical gradient

For cases where the gradient of \(\tau(\theta)\) is complex, we can compute a numerical gradient. Again, the algorithm finds the gradient by nudging \(\alpha\) and \(\beta\) and checking the change in \(\tau\).

library(numDeriv)   # for numerical gradients

# create the function tau
tau_fn <- function(theta) {  
  a <- theta[1]
  b <- theta[2]
  a / (a + b)
}

# compute the gradient of tau
grad <- grad(func = tau_fn, x = fit$theta_hat)

# delta method
var_hat_mu <- grad %*% fit$var_hat %*% grad  # R transposes grad as needed
var_hat_mu
             [,1]
[1,] 2.637491e-06
# sqrt of variance to find SE
se_hat_mu <- sqrt(var_hat_mu) 
se_hat_mu
            [,1]
[1,] 0.001624035