Unordered Logit

The Inverse-Logit Function

Suppose \(y_i \in \{0, 1\}\). Using logistic regression, we model \(\Pr(y_i = 1)\) as a function of covariates, so that

\[ \Pr(Y_i = 1) = \frac{\exp(X_i \beta)}{1 + \exp(X_i \beta)}. \]

The function \(f(z) = \frac{e^z}{1 + e^z}\) is called the inverse-logit function and maps any real number \(X_i \beta\) into a probability between 0 and 1.

Because the two probabilities must sum to one:

\[ \Pr(Y_i = 0) = 1 - \Pr(Y_i = 1) = \frac{1}{1 + \exp(X_i \beta)}. \]

The Softmax Function

The softmax generalizes the inverse-logit–it takes \(J\) real input (i.e., from \(\mathbb{R}^J\) and scales them to sum to one.

\[ \text{softmax}(z_j) = \frac{\exp(z_j)}{\sum_{k=1}^{J} \exp(z_k)}, \qquad j = 1,\ldots,J. \] Because the \(J\) outputs must sum to one, it is natural to interpret the outputs as probabilities.

The OJS widget below allows you to experiment with \(J = 4\) inputs and see how the outputted probabilities change.

viewof z1 = Inputs.range([-5, 5], {step: 0.1, value: 1,   label: "z1"});
viewof z2 = Inputs.range([-5, 5], {step: 0.1, value: 0.5, label: "z2"});
viewof z3 = Inputs.range([-5, 5], {step: 0.1, value: -0.5,label: "z3"});
viewof z4 = Inputs.range([-5, 5], {step: 0.1, value: -1,  label: "z4"});

// Softmax pieces
labels = ["Pr(A)","Pr(B)","Pr(C)","Pr(D)"];
zs = [z1, z2, z3, z4];
exps = zs.map(Math.exp);
sumexp = exps.reduce((a, b) => a + b, 0);
ps = exps.map(e => e / sumexp);

// Data for stacked bar (one row, four segments)
data = ps.map((p, i) => ({ label: labels[i], value: p, row: "softmax" }));

// === Two-column layout: sliders (left) + live table (right) ===
{
const container = html`<div style="
  display:flex;
  gap:24px;
  align-items:flex-start;
  padding-top:8px;
"></div>`;

// Left column: sliders
const left = html`<div style="display:grid; gap:10px; min-width:220px;"></div>`;
left.append(viewof z1, viewof z2, viewof z3, viewof z4);

// Right column: table
const rows = labels.map((cls, i) => ({
Class: cls,
z: zs[i],
"exp(z)": exps[i],
Probability: ps[i]
}));

const right = html`<div style="
  min-width:240px;
  max-width:660px;
  padding:4px 0;
  font-size:2.00rem;
  line-height:1.1;
"></div>`;

right.append(
Inputs.table(rows, {
columns: ["Class", "z", "exp(z)", "Probability"],
format: {
z: d => d.toFixed(2),
"exp(z)": d => d.toFixed(3),
Probability: d => (d * 100).toFixed(1) + "%"
}
})
);

container.append(left, right);
return container;
}

Plot.plot({
width: 900,              // wider
height: 120,             // taller
marginLeft: 20,
marginRight: 20,
marginBottom: 50,
x: {
domain: [0, 1],
tickFormat: d => (d * 100).toFixed(0) + "%",
label: "Probability",
labelArrow: false,
labelAnchor: "center",
tickSize: 10,
tickPadding: 8
},
y: { axis: null },

// Bigger fonts everywhere
style: {
fontSize: "18px",        // main text size
fontFamily: "sans-serif"
},

color: {
legend: true,
label: "Class",
tickSize: 12
},

marks: [
Plot.barX(
data,
Plot.stackX({ x: "value", y: "row", fill: "label", sort: { fill: null } })
),
Plot.ruleX([0, 1])
]
})

Adding covariates

Now suppose the outcome takes \(J\) possible values:

\[ y_i \in \{1, 2, \ldots, J\}. \]

Example 1: We might label vote choice in the US as (1) abstain, (2) Republican, (3) Democrat, or (4) other.

Example 2: We might label coup attempts in a given country-year as (0) none attempted, (1) failed attempt, or (2) successful attempt.

For each category \(j\), define a linear predictor \(\eta_{ij} = X_i \beta_j\). Here, each \(\beta_j\) is a vector of coefficients.

We can use the softmax function to convert these linear predictors into probabilities that sum to one.

\[ \Pr(Y_i = j) = \frac{\exp(\eta_{ij})}{\sum_{k=1}^{J} \exp(\eta_{ik})}, \qquad j = 1, \ldots, J. \]

This is a generalization of the inverse-logit to \(J\) categories. If \(J = 2\), the softmax reduces to the inverse-logit.

However, this model is not identified because adding any constant to each of the \(\eta_{ij}\) produces the same probabilities.

Example Inputs

Input \(z_j\)	Softmax \(p_j\)
\(-1\)	0.0303
\(1\)	0.2242
\(2\)	0.6095
\(0.5\)	0.1360

Adding a Constant (\(+2\))

New Input \(z_j + 2\)	Softmax \(p_j\)
\(-1 + 2 = 1\)	0.0303
\(1 + 2 = 3\)	0.2242
\(2 + 2 = 4\)	0.6095
\(0.5 + 2 = 2.5\)	0.1360

Adding a constant to all inputs leaves the softmax unchanged. This is why the multinomial logit model requires an identification constraint.

To identify the model, we set \(\eta_{iJ} = 1\). Then for \(j = 1, \ldots, J-1\), we have

\[ \Pr(Y_i = j) = \frac{\exp(X_i \beta_j)} {1 + \sum_{k=1}^{J-1} \exp(X_i \beta_k)}. \]

And for the “baseline” category \(J\), we have

\[ \Pr(Y_i = J) = \frac{1} {1 + \sum_{k=1}^{J-1} \exp(X_i \beta_k)}. \]

# load data
anes <- read_csv("https://pos5747.github.io/data/red-state-pid.csv") |>
  mutate(pid = factor(pid, ordered = TRUE)) |>
  mutate(pid3 = case_when(
    pid %in% c("1. Strong Democrat",
               "2. Not very strong Democrat") ~ "D",
    pid %in% c("6. Not very strong Republican",
               "7. Strong Republican") ~ "R",
    pid %in% c("3. Independent (leans Democrat)",
               "4. Independent",
               "5. Independent (leans Republican)") ~ "I"
  )) |>
  glimpse()

Rows: 4,596
Columns: 6
$ state_name             <chr> "Oklahoma", "Idaho", "Virginia", "California", …
$ state_median_income    <dbl> 66.1, 81.2, 92.1, 100.1, 97.1, 79.7, 77.5, 100.…
$ rs_state_median_income <dbl> -0.595342224, -0.010181986, 0.412218451, 0.7222…
$ income                 <dbl> 1.5, 1.5, 1.5, -0.5, -0.5, -0.5, -0.5, -2.5, 0.…
$ pid                    <ord> 7. Strong Republican, 4. Independent, 1. Strong…
$ pid3                   <chr> "R", "I", "D", "R", "I", "I", "I", "I", "D", "D…

library(nnet)
fit <- multinom(pid3 ~ income, data = anes)

# weights:  9 (4 variable)
initial  value 5049.222079 
final  value 5029.924776 
converged

# summarize fit
summary(fit)

Call:
multinom(formula = pid3 ~ income, data = anes)

Coefficients:
  (Intercept)       income
I  -0.1676635 -0.058812795
R  -0.1575483 -0.005553168

Std. Errors:
  (Intercept)     income
I  0.03689560 0.02554012
R  0.03696938 0.02558134

Residual Deviance: 10059.85 
AIC: 10067.85

# use {marginaleffects}
library(marginaleffects)
p <- predictions(fit, 
            newdata = datagrid(income = unique)) |>
  glimpse()

Rows: 18
Columns: 12
$ rowid     <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6
$ group     <chr> "D", "D", "D", "D", "D", "D", "I", "I", "I", "I", "I", "I", …
$ estimate  <dbl> 0.3514003, 0.3590651, 0.3666381, 0.3741114, 0.3814781, 0.388…
$ std.error <dbl> 0.015903413, 0.011740574, 0.008352337, 0.007175789, 0.009256…
$ statistic <dbl> 22.09590, 30.58327, 43.89646, 52.13523, 41.21338, 29.40193, …
$ p.value   <dbl> 3.460671e-108, 2.043310e-205, 0.000000e+00, 0.000000e+00, 0.…
$ s.value   <dbl> 356.9772, 679.9644, Inf, Inf, Inf, 628.7910, 334.8351, 601.5…
$ conf.low  <dbl> 0.3202302, 0.3360540, 0.3502678, 0.3600472, 0.3633363, 0.362…
$ conf.high <dbl> 0.3825704, 0.3820762, 0.3830083, 0.3881757, 0.3996199, 0.414…
$ pid3      <chr> "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", …
$ income    <dbl> -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, -2.5, -1.5, -0.5, 0.5, 1.5,…
$ df        <dbl> Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, …

ggplot(p, aes(x = income, y = estimate, color = group)) + 
  geom_point() + 
  geom_line()

# load packages
library(brms)

# fit same model, but coefs vary across years
fit <- brm(pid3 ~ income*rs_state_median_income + (1 + income | state_name), 
           data = anes, 
           family = categorical(),
           cores = 4, 
           chains = 4, 
           warmup = 2000, 
           iter = 4000,
           backend = "rstan")

Code

# make grid 
grid <- anes %>%
  dplyr::select(state_name, state_median_income, rs_state_median_income) %>%
  distinct() %>%
  mutate(income = NA)

# make predictions
p <- predictions(fit, 
                 variables = list(income = unique),
                 newdata = grid) |>
  mutate(state_name = reorder(state_name, rs_state_median_income))

# make plot
gg <- ggplot(p, aes(x = income, y = estimate, color = group)) +
  geom_line() +
  facet_wrap(vars(state_name))
print(gg)

An observation worth remembering

glimpse(anes)

Rows: 4,596
Columns: 6
$ state_name             <chr> "Oklahoma", "Idaho", "Virginia", "California", …
$ state_median_income    <dbl> 66.1, 81.2, 92.1, 100.1, 97.1, 79.7, 77.5, 100.…
$ rs_state_median_income <dbl> -0.595342224, -0.010181986, 0.412218451, 0.7222…
$ income                 <dbl> 1.5, 1.5, 1.5, -0.5, -0.5, -0.5, -0.5, -2.5, 0.…
$ pid                    <ord> 7. Strong Republican, 4. Independent, 1. Strong…
$ pid3                   <chr> "R", "I", "D", "R", "I", "I", "I", "I", "D", "D…

f <- pid ~ income
fit1 <- MASS::polr(f, data = anes)
fit2 <- multinom(f, data = anes)

# weights:  21 (12 variable)
initial  value 8943.403045 
iter  10 value 8445.069398
final  value 8432.049880 
converged

BIC(fit1, fit2)

     df      BIC
fit1  7 17011.54
fit2 12 16965.30