18 Censoring

For some outcomes, we certain observations are censored. For the non-censored observations, we know the value exactly, as usual. But for the censored observations, we only know the interval that the value lies within.

For example, we could imagine a sliding scale on a survey that allows respondents to report their exact income within the $10k to $150k interval, but the endpoints are “less than $10k” and “greater than $150k.” For most respondents, we know their income. But for others, we only know that it is less than $10k. And for others, we only know it is greater than $150k.

18.1 Adjusting the individual likelihood

In uncensored data, the likelihood contribution of an observation $y_i$ is the pdf/pmf evaluated at the observed outcome $L_i(\theta) = f(y_i; \theta)$.

For censored observations, $y_i$ is unknown, so we replace $f(y_i; \theta)$ with the probability of the observed region.

Right-censoring at $c$ (we only know $y_i>c$) \[ L_i(\theta) = \Pr(y_i>c; \theta) = 1 - F(c; \theta). \]
Left-censoring at $c$ (we only know $y_i<c$) \[ L_i(\theta) = \Pr(y_i<c_i; \theta) = F(c; \theta). \]
Interval censoring on $(a,b)$ (we only know $a<y_i<b$) \[ L_i(\theta) = \Pr(a<y_i<b; \theta) = F(b; \theta) - F(a; \theta). \]

18.2 Combining into the full likelihood

The full likelihood multiplies pdfs/pmfs for fully observed cases and cumulative probabilities for censored cases.

Here is an example dataset that shows the observed data and their corresponding likelihood contributions.

\[ \begin{array}{c c c c c c c c c c c} y_1=2 & ; & y_2>3 & ; & y_3=1 & ; & y_4<1 & ; & y_5 \in (1,3) & ; & \cdots \\[6pt] f(2) & \times & \Pr(Y>3) & \times & f(1) & \times & \Pr(Y<1) & \times & \Pr(1<Y<3) & \times & \cdots \\[6pt] f(2) & \times & 1-F(3) & \times & f(1) & \times & F(1) & \times & F(3)-F(1) & \times & \cdots \end{array} \] This gives us the likelihood

\[ \text{the likelihood: } L(\theta) = f(2) \cdot [1-F(3)] \cdot f(1) \cdot F(1) \cdot [F(3)-F(1)]\cdots \]

It becomes more complicated to write the likelihood, because we need a way to flag the observation that are censored and how they are censored. But the intuition remains the same.

Here are a couple of ways we might write this likelihood.

18.2.1 Option 1

For a observations $i \in \{1, 2, ... n\}$, let

$U = \{i : \text{uncensored}\}$ with observed value $y_i$
$R = \{i : \text{right-censored at } c_i\}$
$L = \{i : \text{left-censored at } c_i\}$
$Q = \{i : \text{interval-censored on } (a_i,b_i)\}$

Then we can write the likelihood as

\[ L(\theta) = \overbrace{\prod_{i \in U} f(y_i \mid \theta)}^{\text{uncensored}}\; \cdot \overbrace{\prod_{i \in R} \{1 - F(c_i \mid \theta)\}}^{\text{right censored}}\; \cdot \overbrace{\prod_{i \in L} F(c_i \mid \theta)}^{\text{left censored}}\; \cdot \overbrace{\prod_{i \in Q} \{F(b_i \mid \theta) - F(a_i \mid \theta)\}}^{\text{interval censored}}. \]

18.2.2 Option 2

Equivalently, we can use the same trick used by the Bernoulli pmf, with indicator variables $u_i, r_i, l_i, q_i \in \{0,1\}$,

\[ L(\theta) = \prod_{i=1}^n \Big[ f(y_i \mid \theta)^{u_i} \{1 - F(c_i \mid \theta)\}^{r_i} F(c_i \mid \theta)^{l_i} \{F(b_i \mid \theta) - F(a_i \mid \theta)\}^{q_i} \Big]. \]

18.3 Example

To see how censoring can bias estimates and how adjusting the likelihood can fix the bias, let’s simulate some fake data with censoring.

# set seed for reproducibility
set.seed(1)

# simulate latent outcome from usual normal linear model
n <- 100
x <- rexp(100, rate = 1/5)
X <- cbind(1, x) # intercept + one predictor
beta <- c(1, 0.5)
sigma <- 2
mu <- X %*% beta
y_star <- rnorm(n, mean = mu, sd = sigma)  # if fully observed

# censor data
# note: y = c if censored
c <- 5
d <- as.integer(y_star > c)  # 1 = censored, 0 = uncensored
y <- ifelse(d == 1, c, y_star)  # observed outcome (i.e., the censored outcome)

# make data frame
data <- data.frame(y, x, y_star, d)
head(data)

          y          x    y_star d
1 1.1262111  3.7759092 1.1262111 0
2 5.0000000  5.9082139 5.1466250 1
3 1.6037021  0.7285336 1.6037021 0
4 0.7851404  0.6989763 0.7851404 0
5 5.0000000  2.1803431 5.0021484 1
6 5.0000000 14.4748427 8.6954605 1

# verify the amount of censoring
mean(d)  # proportion of data that are censored

[1] 0.28

The plot below shows the structure of censored data. The blue points are censored. We do not know the values, we only know they fall above the threshold (i.e., are right-censored).

Code

# load packages
library(hrbrthemes)
library(showtext)

# download and register Source Sans 3 from google fonts
font_add_google("Source Sans 3", family = "Source Sans 3")
showtext_auto() 

# create factor
gg_data <- data |>
  mutate(d_lbl = factor(d, levels = c(0, 1),                   # original coding of d
      labels = c("Not Censored", "Censored")))

# make plot
ggplot(gg_data, aes(x = x, y = y_star, color = d_lbl)) +
  geom_hline(yintercept = c) + 
  annotate("label", x = 17.5, y = c, label = "Censoring Threshold", size = 3) + 
  geom_segment(
    data = filter(gg_data, d == 1),
    aes(x = x, xend = x, y = y_star, yend = y),
    arrow = arrow(length = unit(0.15, "cm")),
    inherit.aes = FALSE,
    color = "grey50"
  ) +
  geom_point() +
  # arrows from y_star to y, only for censored cases
  labs(y = "Outcome", color = "Censoring") +
  labs(title = "An Example Censored Dataset",
       subtitle = "Values Above 5 Are Right-Censored",
       x = "Hypothetical Explanatory Variable",
       y = "Latent Outcome Variable", 
       color = "Type") +
  theme_ipsum(base_family = "Source Sans 3") + 
  scale_color_manual(values = c("#e41a1c", "#377eb8", "#4daf4a"))

18.3.1 Ignore censoring

One approach would be to ignore the censoring. Remember that, by convention, the censored values have the cutoff value. In the fake data, I chose c <- 5 for the cutoff. Ignoring the censoring, we estimate a slope of 0.23 give-or-take 0.03. This is much too low.

# fit ignoring censoring
fit_lm <- lm(y ~ x)
arm::display(fit_lm)

lm(formula = y ~ x)
            coef.est coef.se
(Intercept) 1.68     0.24   
x           0.23     0.03   
---
n = 100, k = 2
residual sd = 1.59, R-Squared = 0.32

18.3.2 Model censoring

Alternatively, we could model the censoring. After adjusting the likelihood function as described above, the usual recipe works as expected (ML to estimate parameters → Hessian for their variances → invariance property for quantities of interest → delta method for their variances).

# log-likelihood w/ censoring
normal_rightcens_ll <- function(theta, y, X, c, d) {
  # tidy up parameters
  k <- ncol(X)
  beta <- theta[1:k]
  sigma <- theta[k + 1]
  mu <- X %*% beta

  # uncensored
  ll_unc <- dnorm(y, mean = mu, sd = sigma, log = TRUE)

  # right-censored at c: log Pr(Y > c)
  ll_cens <- pnorm(c, mean = mu, sd = sigma, lower.tail = FALSE, log.p = TRUE)

  # multiply
  ll <- sum((1 - d) * ll_unc + d * ll_cens)
  return(ll)
}

# optim
theta_start <- c(rep(0, ncol(X)), 2)
est <- optim(
  par     = theta_start,
  fn      = normal_rightcens_ll,
  y       = y,
  X       = X,
  c       = c,
  d       = d,
  method  = "BFGS",
  control = list(fnscale = -1),
  hessian = TRUE
)

# point estimates
est$par

[1] 0.8941892 0.5134936 1.8645425

The figure below shows the estimated intercept, slope, and error SD from the two approaches, compared to the truth. This figure shows that OLS is underestimating the slope, as you might have guessed from the scatterplot above.

18.4 More Reading

The normal model with censoring that I used above to illustrate the idea is sometimes called the “tobit model.” King (1998, 208–10) provides a brief discussion of censoring. Long (1997, 7:187–215) provides a chapter-length description of censoring in the context of the tobit model. Box-Steffensmeier and Jones (2004, 15–19) discuss censoring in the context of duration models, where is is seemingly omnipresent.

King, Gary. 1998. Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Revised edition. Ann Arbor: University of Michigan Press.

Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Vol. 7. Advanced Quantitative Techniques in the Social Sciences. Thousand Oaks, CA: Sage.

Box-Steffensmeier, Janet M., and Bradford S. Jones. 2004. Event History Modeling: A Guide for Social Scientists. Cambridge: Cambridge University Press.