4

Consider a situation where there can be membership in group $A$, group $B$, both groups, or neither group. If we wanted to predict group membership probabilities from some covariate information, this would be a problem.

However, an alternative modeling strategy (I fear a rather poor one) might be to say that there are three categories—$A$, $B$, and nothing—and proceed with the predictions assuming three categories. Plenty of machine learning work has been done for similar problems with categorical outcomes (e.g., MNIST digit classification), and among the most basic ways to solve such a problem is through multinomial logistic regression. (Regardless of the particular modeling strategy (e.g., a neural network approach), it seems like the conditional distribution is a multinomial on one roll of the die.)

If the $A$ and $B$ categories are mutually exclusive, I see no problem. There are just three categories, much as there are ten categories in the MNIST digits. However, $A$ and $B$ are not mutually exclusive. There are times when both occur, in addition to only one of the two occurring or neither occurring.

That concerns me immensely when it comes to fitting a multinomial logistic regression. If we code the categories with the standard $0$ and $1$ indicators, we wind up with something like:

$$ A: (0,1,0)\\ B: (0,0,1)\\ \text{Neither}: (1,0,0)\\ \text{Both}: (0,1,1) $$

By having a vector with two $1$s in it, we seem to be telling the model that these are not conditional multinomial distributions on just one trial (one roll of the die), as we need at least two trials to get a vector like $(0,1,1)$, rather than the one trial in a multinomial logistic regression for a multi-class problem like MNIST digit classification. If we have two trials, then vectors like $(1,0,1)$ and $(2,0,0)$ should be possible, and we know them not to be.

Thus, it seems like a mistake to shoehorn a multilabel problem into this kind of multinomial logistic regression.

However, does multinomial logistic regression exhibit robustness to this kind of apparent mistake? (Is it even a mistake at all?)

This comes from a paper (in a good journal for that field) I read where executives could have either of two kinds of severance packages, both kinds, or neither. The regression modeling treated the problem as having three categories, seemingly ignoring the possibility of executives having both kinds of severance packages (which one of the authors confirmed to me is rather common). “But this is a problem,” I thought as I read it, especially after the author confirmed to me that the two severance packages are not mutually exclusive the way that “7” and “4” are in the MNIST digits.

EDIT

In a simulation, such a multinomial logistic regression approach gives left skewness to the distribution of p-values when the null hypothesis is true. This does not paint a rosy picture for such an approach, even if it is not absoutely emphatic evidence that such an approach is highly problematic.

library(ggplot2)
library(lmtest)
library(VGAM)
library(nnet)
set.seed(2023)
N <- 10000
B <- 1000
x <- runif(N, 0, 1)

ps_mlr <- rep(NA, B) for (i in 1:B){

Simulate two categorical variables (each is a type of severance package, not

necessarily mutually exclusive)

y2 <- rbinom(N, 1, 0.5) y3 <- rbinom(N, 1, 0.5)

Create multinomial-type data from the multi-label

y <- cbind(rep(1, N), rep(0, N), rep(0, N)) for (j in 1:N){

if (y2[j] == 1){
  y[j, c(1, 2)] &lt;- c(0, 1)
}
if (y3[j] == 1){
  y[j, c(1, 3)] &lt;- c(0, 1)
}

}

Fit full multinomial logistic regression

L1 <- nnet::multinom(y ~ x)

Fit null model

L0 <- nnet::multinom(y ~ 1)

Likelihood ratio test the full multinomial logistic regression

Store p-value

ps_mlr[i] <- lmtest::lrtest(L0, L1)$Pr(&gt;Chisq)[2]

} d_1 <- data.frame( p = ps_mlr, CDF = ecdf(ps_mlr)(ps_mlr), Group = "Null Hypothesis is True" ) d_2 <- data.frame( p = seq(0, 1, 0.001), CDF = ecdf(seq(0, 1, 0.001))(seq(0, 1, 0.001)), Group = "Theoretical Distribution of p-values" ) d <- rbind(d_1, d_2) ggplot(d, aes(x = p, y = CDF, col = Group)) + geom_line()

p-value CDF

Dave
  • 62,186
  • Why not code it as A (1,0,0,0), B(0,1,0,0), both (0,0,1,0) and neither (0,0,0,1)? ... or two separate logistic regressions? – Dikran Marsupial Aug 18 '23 at 16:24
  • 1
    @DikranMarsupial I see alternative and seemingly preferable approaches. I ask about this approach in particular because a friend published a paper doing this, and I am wondering if my, “Oh, friend, you really shouldn’t have done it that way,” was warranted, or if a multi-label approach is technically correct but not so much better. – Dave Aug 19 '23 at 00:39
  • I think you were 100% right! – Dikran Marsupial Aug 19 '23 at 06:38
  • It would be interesting to know why the friend adopted this approach, I can sort of see why you might want to do that if A and B were not independent, but in that case the four-outcome, four-output coding is probably right. I looked at a problem involving spatial rainfall patterns where a more compact solution than enumerating all possible 2^n rain/no rain at each location. But it could just be that there was software for MLR available already, which is not such a good reason ;o) – Dikran Marsupial Aug 19 '23 at 13:20
  • 1
    @DikranMarsupial What do you mean about A and B being “independent”? It is known that CEOs who get one type of severance typically get the other. In that sense, I would consider my A and B to be quite dependent, but I don’t see the connection to your rain/no rain problem. (The rain/no rain problem almost seems like image segmentation where you model if each part of a grid does or does not contain something.) – Dave Aug 20 '23 at 13:56
  • it was station data rather than gridded data, but the rain/no-rain for individual stations is dependent because the rainfall (e.g. related to weather fronts) is spatially correlated. So if you want to generate realistic rainfall patterns, you can't just use the probability of rainfall independently for each station ignorng the others. This is important if you are looking at something like flood prediction, where it is the pattern across the catchment that matters. If A happening affects the probability of B happening (or vice-versa) there may be a benefit in modelling dependency. – Dikran Marsupial Aug 20 '23 at 14:05
  • I can sort of see why your fiend took this approach now, but the four-outcome MLR model would be the right approach to modelling the dependency between outcomes AFAICS. – Dikran Marsupial Aug 20 '23 at 14:10
  • Some sort of copula model might be appropriate for this problem (dependency between classes)? – Dikran Marsupial Aug 21 '23 at 12:38
  • 1
    @DikranMarsupial That’s why I think something like bivariate probit might be a good place to start. I don’t have experience using such a model (touched on it a bit in school, nothing really since then), but it seems like that would allow for the dependency between classes (high correlation in the bivariate normal would correspond to the two severance packages tending to happen together or neither happening, I think), and then more complex copulae could model more bizarre types of dependence (though I have struggles to my wrap my brain around how). – Dave Aug 21 '23 at 13:00

1 Answers1

2

My intuition is that this is not going to be a good idea, however at the moment I can only offer a possible example where it will give you an obviously wrong answer. Say the targets are independent of the input attributes, and there are an equal number of "A", "B", "neither" and "both" then the conditional mean everywhere (which is what the model predicts) will be a constant (1/4, 1/2, 1/2) which is a value the output can't represent because of the softmax activation function (using horrible neural network terminology ;o) as it sums to something greater than one.

I think in this situation we would like the output first output to be half the value of the other two (as "neither" happens 1/4 of the time, but the outputs for "A" is lit half the time, either when it is an "A" or because it is a "both", and likewise for the "B" output). I think the closest you could get would be (1/5, 2/5, 2/5) for it to sum to one. That gives the probability of "neither" as being 1/5 when it should be 1/4.

So I'd say it isn't robust because the "sum to one" constraint is not appropriate for this method of encoding the classes.

Caveat lector: I've only tried this on my addled brain, not a computer, so I could be writing nonsense again.

Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204
  • 1
    Questions remain, but this is an interesting take on the matter, and our discussion in the comments has been interesting, too. +50 – Dave Aug 25 '23 at 17:15