7 Delta Method

The invariance property tells us that if $\hat{\theta}$ is the ML estimate of $\theta$, then $\hat\tau = \tau(\hat{\theta})$ is the ML estimate of $\tau(\theta)$. But this raises a new question: how do we obtain the standard error of $\hat\tau = \tau(\hat{\theta})$? We have estimated the SE of $\hat{\theta}$, but how do we convert that to an estimate of the SE of $\hat\tau = \tau(\hat{\theta})$?

Key Idea: The delta method uses a Taylor expansion to approximate the variance of $\hat\tau = \tau(\hat{\theta})$ using the variance of $\hat{\theta}$.

A Taylor expansion uses derivatives to approximate a function near a particular point. For a one-variable function $\tau(\theta)$ expanded around $\hat{\theta}$,

\[ \small \tau(\theta) \;\approx\; \tau(\hat{\theta}) + \tau'(\hat{\theta})(\theta - \hat{\theta}) + \tfrac{1}{2}\tau''(\hat{\theta})(\theta - \hat{\theta})^2 + \cdots \]

When $\theta$ is close to $\hat{\theta}$, the higher-order terms are small. Dropping them gives the linear approximation \[ \small \tau(\theta) \;\approx\; \tau(\hat{\theta}) + \tau'(\hat{\theta})(\theta - \hat{\theta}). \]

Thus, small deviations of $\theta$ from $\hat{\theta}$ are scaled by the slope $\tau'(\hat{\theta})$. The delta method captures this scaling \[ \small \operatorname{Var}\!\big(\tau(\hat{\theta})\big) \;\approx\; \big(\tau'(\hat{\theta})\big)^2 \, \operatorname{Var}(\hat{\theta}). \] The square comes from the rule \[ \small \operatorname{Var}(cX) = c^2 \operatorname{Var}(X). \]

Intuitively, you can think of a Taylor expansion as “zooming in” until the curve looks straight. Near $\theta$, small wiggles in $\hat{\theta}$ are expanded or shrunk by $\tau(\cdot)$ depending on the slope. The delta method adjusts for this shrinking or expanding.

In the one-parameter case, $\widehat{\operatorname{Var}}[\tau(\hat{\theta})] \approx \Big(\tau'(\hat{\theta})\Big)^2 \cdot \widehat{\operatorname{Var}}(\hat{\theta})$.

In the multi-parameter case, if $\hat{\theta}$ is a vector and $\tau(\cdot)$ is a function, then \[ \widehat{\operatorname{Var}}(\hat\tau) = \widehat{\operatorname{Var}}[\tau(\hat{\theta})] \approx \nabla \tau(\hat{\theta})^\top \cdot \widehat{\operatorname{Var}}(\hat{\theta}) \cdot \nabla \tau(\hat{\theta}), \] where $\nabla \tau(\hat{\theta})$ is the gradient of $\tau(\theta)$ with respect to $\theta$ evaluated at $\hat\theta$.

From there, we can create Wald confidence intervals in the usual way. Recall that $\widehat{\text{SE}}(\hat{\tau}) = \sqrt{\widehat{\operatorname{Var}}(\hat\tau)}$.

\[ \begin{align*} 90\%~\text{C.I.} &= \hat{\tau} \pm 1.64 \cdot \widehat{\text{SE}}(\hat{\tau})\\ 95\%~\text{C.I.} &= \hat{\tau} \pm 1.96 \cdot \widehat{\text{SE}}(\hat{\tau}) \end{align*} \]

7.1 Bernoulli: From $\pi$ to odds

Suppose a Bernoulli model of a binary outcome $y$. The ML estimate of $\pi$ is $\hat{\pi} = \text{avg}(y)$. Using the Fisher information, the variance of $\hat{\pi}$ is $\widehat{\operatorname{Var}}(\hat{\pi}) = \dfrac{\hat{\pi}(1 - \hat{\pi})}{N}$.

Suppose we want to transform $\pi$ to the odds, where $\text{odds} = \tau(\pi) = \dfrac{\pi}{1 - \pi}$. $\tau(\pi) = \dfrac{\pi}{1 - \pi}$. We use the invariance property to obtain an ML estimate $\widehat{\text{odds}} = \dfrac{\hat\pi}{1 - \hat\pi}$.

To estimate the variance of $\widehat{\text{odds}}$, we need to use the delta method.

First, find the first derivative of $\tau(\pi)$. $\tau'(\pi) = \dfrac{1}{(1 - \pi)^2}$.

Plugging in, we have

\[ \begin{aligned} \widehat{\operatorname{Var}}(\widehat{\text{odds}}) &\approx \left(\dfrac{1}{(1 - \hat{\pi})^2}\right)^2 \cdot \dfrac{\hat{\pi}(1 - \hat{\pi})}{N}\\[6pt] &= \dfrac{1}{(1 - \hat{\pi})^4} \cdot \dfrac{\hat{\pi}(1 - \hat{\pi})}{N}\\[6pt] &= \dfrac{\hat{\pi}(1 - \hat{\pi})}{N(1 - \hat{\pi})^4}\\[6pt] &= \dfrac{\hat{\pi}}{N(1 - \hat{\pi})^3}. \end{aligned} \]

And the standard error is

\[ \widehat{\text{SE}}(\widehat{\text{odds}}) \;\approx\; \sqrt{\dfrac{\hat{\pi}}{N(1 - \hat{\pi})^3}}. \]

Assuming a data set with 150 trials and 8 successes, we have the following estimates.

# number of trials and successes
n <- 150
y <- 8

# estimate pi
pi_hat <- y / n
pi_hat

[1] 0.05333333

# variance and SE estimate for pi_hat
var_pi_hat <- pi_hat * (1 - pi_hat) / n
se_pi_hat  <- sqrt(var_pi_hat)
se_pi_hat

[1] 0.01834646

# transform to estimate odds
odds_hat <- pi_hat / (1 - pi_hat)
odds_hat

[1] 0.05633803

# variance and SE for odds_hat (delta method)
var_odds_hat <- pi_hat / (n * (1 - pi_hat)^3)
se_odds_hat  <- sqrt(var_odds_hat)
se_odds_hat

[1] 0.0204719

7.2 Poisson: From $\lambda$ to SD

Suppose a Poisson model of a count variable $y$. The ML estimate is $\hat{\lambda} = \text{avg}(y)$. We can use the information matrix to estimate the variance $\widehat{\operatorname{Var}}(\hat{\lambda}) = \dfrac{\hat{\lambda}}{N}$.

Suppose we want to estimate the standard deviation of the Poisson distribution, which is $\sigma = \sqrt{\lambda}$. Then $\sigma = \tau(\lambda) = \sqrt{\lambda}$ and $\tau'(\lambda) = \dfrac{1}{2\sqrt{\lambda}}$. Then, by the delta method, $\widehat{\operatorname{Var}}(\hat{\sigma}) \approx \left(\dfrac{1}{2\sqrt{\hat{\lambda}}}\right)^2 \cdot \dfrac{\hat{\lambda}}{N}$.

Simplifying, we have $\widehat{\operatorname{Var}}(\hat{\sigma}) \approx \dfrac{1}{4N}$ and $\widehat{\text{SE}}(\hat{\sigma}) \;\approx\; \dfrac{1}{2\sqrt{N}}$.

Notice: This result does not depend on $\hat{\lambda}$; this is perhaps unexpected. We can check our work with a quick simulation.

# fix sample size, vary lambda
n <- 100
lambda_vals <- c(2, 10, 50)

# store results
results <- data.frame(lambda = lambda_vals, se_sigma_mc = NA)

for (j in 1:length(lambda_vals)) {
  lambda_j <- lambda_vals[j]  # current value of lambda
  sigma_hats <- numeric(10000)   # container for 10,000 estimates
  
  for (i in 1:10000) {
    y <- rpois(n, lambda_j)
    lambda_hat <- mean(y)
    sigma_hats[i] <- sqrt(lambda_hat)
  }
  
  results$se_sigma_mc[j] <- sd(sigma_hats)
}

1/(2*sqrt(n))  # delta method SE

[1] 0.05

results  # monte carlo SE

  lambda se_sigma_mc
1      2  0.04981397
2     10  0.05012129
3     50  0.05006480

7.3 Beta Example: From $(\alpha, \beta)$ to $\mu$

Suppose a beta model of a continuous variable $y$ that lies strictly between zero and one. We used optim() to estimate the parameters and their variances.

# log-likelihood function (using dbeta!)
beta_ll_fn <- function(theta, y) { 
  alpha <- theta[1] 
  beta  <- theta[2] 
  ll <- sum(dbeta(y, shape1 = alpha, shape2 = beta, log = TRUE))
  return(ll)
}

# function to fit beta model 
est_beta <- function(y) {
  # use optim; compute hessian
  est <- optim(
    par     = c(2, 2),  # decent starting values for the problem below
    fn      = beta_ll_fn,
    y       = y,
    control = list(fnscale = -1),  
    method  = "BFGS",
    hessian = TRUE            
  ) 
  
  # compute an estimate of covariance matrix (slowly, this first time)
  info_obs <- -est$hessian  # notice negative sign
  var_hat  <- solve(info_obs) 
  
  # check convergence; print warning if needed
  if (est$convergence != 0) print("Model did not converge!")
  
  # return list of elements
  res <- list(theta_hat = est$par, 
              var_hat   = var_hat) 
  return(res)
}

# load packages
library(tidyverse)
library(Lahman)  # data from Lahman's baseball database

# create data frame with batting average
bstats <- battingStats() |> 
  filter(yearID == 2023) |>  # data from 2023
  filter(AB >= 100) |>  # players with at least 100 at-bats
  select(player_id = playerID, batting_average = BA) |>
  arrange(-batting_average) |>
  na.omit() |>
  glimpse()

Rows: 457
Columns: 2
$ player_id       <chr> "arraelu01", "acunaro01", "freemfr01", "diazya01", "se…
$ batting_average <dbl> 0.354, 0.337, 0.331, 0.330, 0.327, 0.319, 0.316, 0.312…

# estimate beta model using the batting average data
fit <- est_beta(bstats$batting_average)
fit$theta_hat  # parameter estimates

[1]  37.07655 114.92550

fit$var_hat  # covariance matrix estimates

          [,1]     [,2]
[1,]  5.964783 18.40870
[2,] 18.408705 57.83667

But the $\alpha$ and $\beta$ parameters are not easy to interpret. Suppose we want the mean $\mu = \dfrac{\alpha}{\alpha + \beta}$. We can use the invariance property.

# while it's not as descriptive, the formulas for the mean
#   and variance as a function of alpha and beta are quite 
#   long, so I use a and b for compactness rather than the
#   more descriptive a_hat or alpha_hat variants.
a <- fit$theta_hat[1]  # alpha_hat
b <- fit$theta_hat[2] # beta_hat

# mu_hat via invariance property
mu_hat <- a/(a + b)  
mu_hat

[1] 0.2439214

But we also might like to estimate the SE for $\hat\mu$. We can do this with the delta method. Recall that the delta method is $\widehat{\operatorname{Var}}[\tau(\hat{\theta})] \approx \nabla \tau(\hat{\theta})^\top \cdot \widehat{\operatorname{Var}}(\hat{\theta}) \cdot \nabla \tau(\hat{\theta})$. This has two parts: the covariance matrix $\widehat{\operatorname{Var}}(\hat{\theta})$ and gradient $\nabla \tau(\hat{\theta})$.

7.3.1 The covariance matrix $\widehat{\operatorname{Var}}(\hat{\theta})$

In this case, $\widehat{\operatorname{Var}}(\hat{\theta}) =\widehat{\operatorname{Var}}(\hat{\alpha}, \hat{\beta})$ is computed by our est_beta() function numerically, so this part is ready to go

fit$var_hat

          [,1]     [,2]
[1,]  5.964783 18.40870
[2,] 18.408705 57.83667

7.3.2 The gradient $\nabla \tau(\hat{\theta})$

The gradient of $\mu = \tau(\alpha, \beta) = \frac{\alpha}{\alpha + \beta}$ is:

\[ \nabla \tau(\alpha,\beta) = \begin{bmatrix} \dfrac{\partial \tau}{\partial \alpha}\\[6pt] \dfrac{\partial \tau}{\partial \beta} \end{bmatrix} = \begin{bmatrix} \dfrac{\beta}{(\alpha+\beta)^2}\\[6pt] -\dfrac{\alpha}{(\alpha+\beta)^2} \end{bmatrix} \]

Plugging in $\hat\alpha$ and $\hat\beta$, we have

\[ \nabla \tau(\hat\alpha, \hat\beta) =\begin{bmatrix}\dfrac{\hat\beta}{(\hat\alpha+\hat\beta)^2}\\-\dfrac{\hat\alpha}{(\hat\alpha+\hat\beta)^2}\end{bmatrix} = \begin{bmatrix}\dfrac{114.93}{(37.08+114.93)^2}\\[6pt]-\dfrac{37.08}{(37.08+114.93)^2}\end{bmatrix} = \begin{bmatrix} 0.0050 \\ -0.0016\end{bmatrix} \]

7.3.3 The matrix algebra

7.3.3.1 Using R

This is a simple calculation with R.

# create gradient using a and b from above
grad <- c(b/(a + b)^2, 
          -a/(a + b)^2)
var_hat_mu <- grad %*% fit$var_hat %*% grad
var_hat_mu

             [,1]
[1,] 2.637491e-06

se_hat_mu <- sqrt(var_hat_mu)
se_hat_mu

            [,1]
[1,] 0.001624035

Note: R treats the two-element numeric vector as a 2x1 matrix or 1x2 matrix as necessary to make the matrix multiplication conformable. This avoids the need to explicitly transpose. We could also make grad a 2x1 column matrix and explicitly transpose if we wanted.

grad <- matrix(c(b/(a + b)^2, -a/(a + b)^2), 
               nrow = 2)
grad

             [,1]
[1,]  0.004974134
[2,] -0.001604724

var_hat_mu <- t(grad) %*% fit$var_hat %*% grad
var_hat_mu

             [,1]
[1,] 2.637491e-06

7.3.3.2 “By hand”

To remind us what R is doing with t(grad) %*% fit$var_hat %*% grad, here is what the matrix multiplication looks like “by hand.”

\[ \begin{aligned} \widehat{\mathrm{Var}}(\widehat{\mu}) &\approx \overbrace{\begin{bmatrix} 0.0050 & -0.0016 \end{bmatrix}}^{\nabla\tau(\hat\alpha,\hat\beta)^{\!\top}} \Biggl( \underbrace{\begin{bmatrix} 5.96 & 18.41\\[2pt] 18.41 & 57.84 \end{bmatrix}}_{\widehat{\mathrm{Var}}(\hat\alpha,\hat\beta)} \underbrace{\begin{bmatrix} 0.0050\\[2pt] -0.0016 \end{bmatrix}}_{\nabla\tau(\hat\alpha,\hat\beta)} \Biggr) && \text{plug in values} \\[10pt] &= \overbrace{\begin{bmatrix} 0.0050 & -0.0016 \end{bmatrix}}^{\nabla\tau^{\!\top}} \begin{bmatrix} 5.96(0.0050) + 18.41(-0.0016)\\[2pt] 18.41(0.0050) + 57.84(-0.0016) \end{bmatrix} && \text{multiply RHS; (2 x 2) x (2 x 1)} \\[10pt] &= \overbrace{\begin{bmatrix} 0.0050 & -0.0016 \end{bmatrix}}^{\nabla\tau^{\!\top}} \begin{bmatrix} 0.00037\\[2pt] -0.0010 \end{bmatrix} && \text{simplify} \\[10pt] &= 0.0050\cdot 0.00037 \;+\; (-0.0016)\cdot(-0.0010) && \text{multiply; (1 x 2) x (2 x 1)} \\[6pt] &\approx 0.0000019 \;+\; 0.0000016 && \text{simplify} \\[6pt] &= 0.0000026 && \text{simplify} \end{aligned} \]

We can compare this to the classical mean and SE estimate. The classical and beta model estimate and SE are very similar, but not identical.

# classical mean
mean(bstats$batting_average)

[1] 0.2439672

# classical SE
sd(bstats$batting_average)/sqrt(length(bstats$batting_average))

[1] 0.001589904

Model	How mean and SE are estimated	Mean	SE
Classical	$\hat\mu = \operatorname{avg}(y)$. $\widehat{\text{SE}} = \frac{\operatorname{SD}(y)}{\sqrt{N}}$.	0.24397	0.001590
Beta model	Estimate $\hat\alpha$ and $\hat\beta$ with ML. Estimate variance of $\hat\alpha$ and $\hat\beta$ with information matrix. Estimate $\hat\mu$ with invariance property. Estimate SE using delta method.	0.24392	0.001624

7.3.4 Numerical gradient

For cases where the gradient of $\tau(\theta)$ is complex, we can compute a numerical gradient. Again, the algorithm finds the gradient by nudging $\alpha$ and $\beta$ and checking the change in $\tau$.

library(numDeriv)   # for numerical gradients

# create the function tau
tau_fn <- function(theta) {  
  a <- theta[1]
  b <- theta[2]
  a / (a + b)
}

# compute the gradient of tau
grad <- grad(func = tau_fn, x = fit$theta_hat)

# delta method
var_hat_mu <- grad %*% fit$var_hat %*% grad  # R transposes grad as needed
var_hat_mu

             [,1]
[1,] 2.637491e-06

# sqrt of variance to find SE
se_hat_mu <- sqrt(var_hat_mu) 
se_hat_mu

            [,1]
[1,] 0.001624035

7 Delta Method

7.1 Bernoulli: From \(\pi\) to odds

7.2 Poisson: From \(\lambda\) to SD

7.3 Beta Example: From \((\alpha, \beta)\) to \(\mu\)

7.3.1 The covariance matrix \(\widehat{\operatorname{Var}}(\hat{\theta})\)

7.3.2 The gradient \(\nabla \tau(\hat{\theta})\)

7.3.3 The matrix algebra

7.3.3.1 Using R

7.3.3.2 “By hand”

7.3.4 Numerical gradient