The invariance property tells us that if \(\hat{\theta}\) is the ML estimate of \(\theta\), then \(\hat\tau = \tau(\hat{\theta})\) is the ML estimate of \(\tau(\theta)\). But this raises a new question: how do we obtain the standard error of \(\hat\tau = \tau(\hat{\theta})\)? We have estimated the SE of \(\hat{\theta}\), but how do we convert that to an estimate of the SE of \(\hat\tau = \tau(\hat{\theta})\)?
Key Idea: The delta method uses a Taylor expansion to approximate the variance of \(\hat\tau = \tau(\hat{\theta})\) using the variance of \(\hat{\theta}\).
A Taylor expansion uses derivatives to approximate a function near a particular point. For a one-variable function \(\tau(\theta)\) expanded around \(\hat{\theta}\),
When \(\theta\) is close to \(\hat{\theta}\), the higher-order terms are small. Dropping them gives the linear approximation \[
\small
\tau(\theta) \;\approx\; \tau(\hat{\theta}) + \tau'(\hat{\theta})(\theta - \hat{\theta}).
\]
Thus, small deviations of \(\theta\) from \(\hat{\theta}\) are scaled by the slope \(\tau'(\hat{\theta})\). The delta method captures this scaling \[
\small \operatorname{Var}\!\big(\tau(\hat{\theta})\big) \;\approx\; \big(\tau'(\hat{\theta})\big)^2 \, \operatorname{Var}(\hat{\theta}).
\] The square comes from the rule \[
\small
\operatorname{Var}(cX) = c^2 \operatorname{Var}(X).
\]
Intuitively, you can think of a Taylor expansion as “zooming in” until the curve looks straight. Near \(\theta\), small wiggles in \(\hat{\theta}\) are expanded or shrunk by \(\tau(\cdot)\) depending on the slope. The delta method adjusts for this shrinking or expanding.
In the one-parameter case, \(\widehat{\operatorname{Var}}[\tau(\hat{\theta})] \approx \Big(\tau'(\hat{\theta})\Big)^2 \cdot \widehat{\operatorname{Var}}(\hat{\theta})\).
In the multi-parameter case, if \(\hat{\theta}\) is a vector and \(\tau(\cdot)\) is a function, then \[
\widehat{\operatorname{Var}}(\hat\tau) = \widehat{\operatorname{Var}}[\tau(\hat{\theta})] \approx \nabla \tau(\hat{\theta})^\top \cdot \widehat{\operatorname{Var}}(\hat{\theta}) \cdot \nabla \tau(\hat{\theta}),
\] where \(\nabla \tau(\hat{\theta})\) is the gradient of \(\tau(\theta)\) with respect to \(\theta\) evaluated at \(\hat\theta\).
From there, we can create Wald confidence intervals in the usual way. Recall that \(\widehat{\text{SE}}(\hat{\tau}) = \sqrt{\widehat{\operatorname{Var}}(\hat\tau)}\).
Suppose a Bernoulli model of a binary outcome \(y\). The ML estimate of \(\pi\) is \(\hat{\pi} = \text{avg}(y)\). Using the Fisher information, the variance of \(\hat{\pi}\) is \(\widehat{\operatorname{Var}}(\hat{\pi}) = \dfrac{\hat{\pi}(1 - \hat{\pi})}{N}\).
Suppose we want to transform \(\pi\) to the odds, where \(\text{odds} = \tau(\pi) = \dfrac{\pi}{1 - \pi}\). \(\tau(\pi) = \dfrac{\pi}{1 - \pi}\). We use the invariance property to obtain an ML estimate \(\widehat{\text{odds}} = \dfrac{\hat\pi}{1 - \hat\pi}\).
To estimate the variance of \(\widehat{\text{odds}}\), we need to use the delta method.
First, find the first derivative of \(\tau(\pi)\). \(\tau'(\pi) = \dfrac{1}{(1 - \pi)^2}\).
Assuming a data set with 150 trials and 8 successes, we have the following estimates.
# number of trials and successesn <-150y <-8# estimate pipi_hat <- y / npi_hat
[1] 0.05333333
# variance and SE estimate for pi_hatvar_pi_hat <- pi_hat * (1- pi_hat) / nse_pi_hat <-sqrt(var_pi_hat)se_pi_hat
[1] 0.01834646
# transform to estimate oddsodds_hat <- pi_hat / (1- pi_hat)odds_hat
[1] 0.05633803
# variance and SE for odds_hat (delta method)var_odds_hat <- pi_hat / (n * (1- pi_hat)^3)se_odds_hat <-sqrt(var_odds_hat)se_odds_hat
[1] 0.0204719
7.2 Poisson: From \(\lambda\) to SD
Suppose a Poisson model of a count variable \(y\). The ML estimate is \(\hat{\lambda} = \text{avg}(y)\). We can use the information matrix to estimate the variance \(\widehat{\operatorname{Var}}(\hat{\lambda}) = \dfrac{\hat{\lambda}}{N}\).
Suppose we want to estimate the standard deviation of the Poisson distribution, which is \(\sigma = \sqrt{\lambda}\). Then \(\sigma = \tau(\lambda) = \sqrt{\lambda}\) and \(\tau'(\lambda) = \dfrac{1}{2\sqrt{\lambda}}\). Then, by the delta method, \(\widehat{\operatorname{Var}}(\hat{\sigma}) \approx \left(\dfrac{1}{2\sqrt{\hat{\lambda}}}\right)^2 \cdot \dfrac{\hat{\lambda}}{N}\).
Simplifying, we have \(\widehat{\operatorname{Var}}(\hat{\sigma}) \approx \dfrac{1}{4N}\) and \(\widehat{\text{SE}}(\hat{\sigma}) \;\approx\; \dfrac{1}{2\sqrt{N}}\).
Notice: This result does not depend on \(\hat{\lambda}\); this is perhaps unexpected. We can check our work with a quick simulation.
# fix sample size, vary lambdan <-100lambda_vals <-c(2, 10, 50)# store resultsresults <-data.frame(lambda = lambda_vals, se_sigma_mc =NA)for (j in1:length(lambda_vals)) { lambda_j <- lambda_vals[j] # current value of lambda sigma_hats <-numeric(10000) # container for 10,000 estimatesfor (i in1:10000) { y <-rpois(n, lambda_j) lambda_hat <-mean(y) sigma_hats[i] <-sqrt(lambda_hat) } results$se_sigma_mc[j] <-sd(sigma_hats)}1/(2*sqrt(n)) # delta method SE
7.3 Beta Example: From \((\alpha, \beta)\) to \(\mu\)
Suppose a beta model of a continuous variable \(y\) that lies strictly between zero and one. We used optim() to estimate the parameters and their variances.
# log-likelihood function (using dbeta!)beta_ll_fn <-function(theta, y) { alpha <- theta[1] beta <- theta[2] ll <-sum(dbeta(y, shape1 = alpha, shape2 = beta, log =TRUE))return(ll)}# function to fit beta model est_beta <-function(y) {# use optim; compute hessian est <-optim(par =c(2, 2), # decent starting values for the problem belowfn = beta_ll_fn,y = y,control =list(fnscale =-1), method ="BFGS",hessian =TRUE ) # compute an estimate of covariance matrix (slowly, this first time) info_obs <--est$hessian # notice negative sign var_hat <-solve(info_obs) # check convergence; print warning if neededif (est$convergence !=0) print("Model did not converge!")# return list of elements res <-list(theta_hat = est$par, var_hat = var_hat) return(res)}# load packageslibrary(tidyverse)library(Lahman) # data from Lahman's baseball database# create data frame with batting averagebstats <-battingStats() |>filter(yearID ==2023) |># data from 2023filter(AB >=100) |># players with at least 100 at-batsselect(player_id = playerID, batting_average = BA) |>arrange(-batting_average) |>na.omit() |>glimpse()
But the \(\alpha\) and \(\beta\) parameters are not easy to interpret. Suppose we want the mean \(\mu = \dfrac{\alpha}{\alpha + \beta}\). We can use the invariance property.
# while it's not as descriptive, the formulas for the mean# and variance as a function of alpha and beta are quite # long, so I use a and b for compactness rather than the# more descriptive a_hat or alpha_hat variants.a <- fit$theta_hat[1] # alpha_hatb <- fit$theta_hat[2] # beta_hat# mu_hat via invariance propertymu_hat <- a/(a + b) mu_hat
[1] 0.2439214
But we also might like to estimate the SE for \(\hat\mu\). We can do this with the delta method. Recall that the delta method is \(\widehat{\operatorname{Var}}[\tau(\hat{\theta})] \approx \nabla \tau(\hat{\theta})^\top \cdot \widehat{\operatorname{Var}}(\hat{\theta}) \cdot \nabla \tau(\hat{\theta})\). This has two parts: the covariance matrix \(\widehat{\operatorname{Var}}(\hat{\theta})\) and gradient \(\nabla \tau(\hat{\theta})\).
7.3.1 The covariance matrix \(\widehat{\operatorname{Var}}(\hat{\theta})\)
In this case, \(\widehat{\operatorname{Var}}(\hat{\theta}) =\widehat{\operatorname{Var}}(\hat{\alpha}, \hat{\beta})\) is computed by our est_beta() function numerically, so this part is ready to go
# create gradient using a and b from abovegrad <-c(b/(a + b)^2, -a/(a + b)^2)var_hat_mu <- grad %*% fit$var_hat %*% gradvar_hat_mu
[,1]
[1,] 2.637491e-06
se_hat_mu <-sqrt(var_hat_mu)se_hat_mu
[,1]
[1,] 0.001624035
Note: R treats the two-element numeric vector as a 2x1 matrix or 1x2 matrix as necessary to make the matrix multiplication conformable. This avoids the need to explicitly transpose. We could also make grad a 2x1 column matrix and explicitly transpose if we wanted.
grad <-matrix(c(b/(a + b)^2, -a/(a + b)^2), nrow =2)grad
Estimate $\hat\alpha$ and $\hat\beta$ with ML. Estimate variance of $\hat\alpha$ and $\hat\beta$ with information matrix. Estimate $\hat\mu$ with invariance property. Estimate SE using delta method.
0.24392
0.001624
7.3.4 Numerical gradient
For cases where the gradient of \(\tau(\theta)\) is complex, we can compute a numerical gradient. Again, the algorithm finds the gradient by nudging \(\alpha\) and \(\beta\) and checking the change in \(\tau\).
library(numDeriv) # for numerical gradients# create the function tautau_fn <-function(theta) { a <- theta[1] b <- theta[2] a / (a + b)}# compute the gradient of taugrad <-grad(func = tau_fn, x = fit$theta_hat)# delta methodvar_hat_mu <- grad %*% fit$var_hat %*% grad # R transposes grad as neededvar_hat_mu
[,1]
[1,] 2.637491e-06
# sqrt of variance to find SEse_hat_mu <-sqrt(var_hat_mu) se_hat_mu