What is the statistical model for a multi-label problem?

Question

In a setting with a binary $y$ like dog/cat, a reasonable statistical model is to posit that the probability parameter $p$ of a $\text{Binomial}(1, 0)$ distribution is some function $f$ of features $X$. This leads to many common machine learning approaches like logistic regression and neural networks.

In a setting with multiple classes in $y$, such as dog/cat/horse, a reasonable statistical model is that the multiple probability values $\vec p$ of a $\text{Multinomial}(1, \vec p)$ distribution is some function of features $X$. Much like in the binary setting, this leads to many common approaches like multinomial logistic regression and various forms of deep learning (e.g., convolutional neural networks for MNIST handwritten digits).

In a multi-label setting, such as identifying if a photograph contains a dog, a cat, or both, what would be the statistical model?

(To a large extent, I think I mean the formal likelihood, regardless of the functional form of expressing the probability of class membership as depending on the features, but I want to leave it a bit vague to allow for an answer that I’m thinking about it wrong by thinking of statistical likelihood.)

EDIT

The comments have clarified that an Ising likelihood works. With that being the case, how do the prior probabilities of the classes come into the picture? For instance, in a logistic regression, if I have $99$ $0$s for every $1$, I expect low probabilities of $1$ unless the features are extremely informative. In a multi-label setting, it seems like the prior probability of each class would be the ratio of that class to all possible alternatives, which I would consider to be zero: out of every possible sight there is to see, the probability of seeing a dog ought to be tiny or even zero (and the fact that we’re on Earth (for now) and near dogs is what allows us to see dogs with frequency).

It seems like something like this fails for multi-label classification.

@JohnMadden I remember Ising models from a stochastic processes class, but I don’t remember much, and my quick read of the Wikipedia article does not clarify what you might mean. Perhaps you could clarify or even write an answer to elaborate. — Dave, Aug 21 '22 at 04:20
For the dog cat binaries, you'd have a multivariate distribution over ${0,1}^2$ (neither, dog only, cat only, both). — Glen_b, Aug 21 '22 at 04:30
@Glen_b I could get on board with that! For a multi-label problem with dog/cat/horse as the possible labels, would it then be ${0,1}^3$ with three probabilities returned (all of which could be small for an input image of, say, a crocodile)? // And do you mean this as a way to tie the problem to the Ising model? I could see that making sense for the lattice in the Ising model. — Dave, Aug 21 '22 at 04:49
The Ising distribution is just a multivariate Bernoulli with arbitrary dependence, specified with a matrix $\Omega$ (and "main effects" specified by vector $\rho$). So in this case it would allow for arbitrary probabilities assigned to the four cases {None, {Dog},{Cat},{Dog,Cat}}, and does so in a way that's nice to further parameterize if you want. It's hard to estimate an arbitrary dense $\Omega$, so common practice is to impose some kind of structure, such as low rank https://www.nature.com/articles/srep09050 or spatial/lattice neighborhood structure. — John Madden, Aug 21 '22 at 16:01
@JohnMadden I’m totally on board with the Ising model being the likelihood. However, the issue of a prior probability (see edit) still confounds me, and I cannot consider the statistical model to be complete unless I understand that. — Dave, Sep 08 '22 at 15:41

score 4 · Answer 1 · answered Sep 14 '22 at 03:56

4

Let's generalise your example by supposing that we have $K$ types of animal/object. As you correctly point out, if we are dealing with observations where these types are mutually exclusive then we would have $K$ possible outcomes and so we would use some kind of multinomial model with $K$ categories. However, if these animals/objects are not mutually exclusive then we would now have $2^K$ possible outcomes so we could now use some kind of multinomial model with $2^K$ categories. In principle this latter case is not really all that different from the former case, and the same types of models can be used. In your example of a dog/cat/horse model, you have $K=3$ so there would be $2^K=8$ possible outcomes.

Note that this type of model does not impose any assumed structure on the combinations of animals/objects in the outcome variable; it is appropriate if $K$ is not too large. If $K$ is large enough that $2^K$ is too large to deal with effectively then we would probably impose more structure on the model to specify some kind of parametric relationship between the animals/objects occurring in an outcome. At that point the two forms of model would become different.

answered Sep 14 '22 at 03:56

Ben

124,856

Perhaps this is your point, but you seem to take a totally different approach than the Ising distribution discussed in the comments. Is my interpretation correct? – Dave Sep 14 '22 at 03:59
The Ising distribution uses additional structure (usually based on an idea of "adjacency" between various binary variables) so it would fall within the class of models you might consider if $K$ is too large to use the multinomial. In the example case with $K=3$ you could easily use the multinomial approach and this would not require any structural assumptions about the relationship between the binary variables at issue. – Ben Sep 14 '22 at 05:37
But then what about the fact that the probability of not belonging to any of the classes is probably one? – Dave Sep 14 '22 at 11:02
The multinomial model with $2^K$ categories includes a category that corresponds to the absence of any of the animals/objects, so it is no problem if there is a high probability that none of them appear. – Ben Sep 14 '22 at 21:33
But when that category has a probability of zero (which it basically is), what happens? – Dave Sep 15 '22 at 00:19
2

It's not clear to me why you assert that the probability of this category must be high or low (e.g., near to zero) --- the multinomial model allows the probabilities of the categories to vary in any possible way over the set of all probability vectors on those categories. The inferred probability of each category would be based on observed data, so if an outcome occurs often we will infer it has high probability, and if an outcome occurs rarely we will infer it has low probability. – Ben Sep 15 '22 at 05:08

score 1 · Answer 2 · answered Sep 18 '22 at 08:07

You can think of the multinomial (aka multinoulli) distribution used in multinomial classification as a multivariate discrete distribution obtained in two steps:

assign a Bernoulli marginal distribution to each of the labels;
use a very specific copula function to model the dependence between the marginals; the copula is such that the Bernoullis are perfectly dependent (if one of them takes value 1, all the others take value 0).

In a multi-label problem, you change the copula because you drop the assumption of perfect dependence (two or more Bernoullis are allowed to simultaneously take value 1).

The problem is that you need to decide which copula to use. For example, you could use the product copula (all the Bernoullis are independent), similarly to what is done in naive Bayes models.

There are several obstacles to choosing / specifying an appropriate copula function:

if the number of labels ($K$) is large, you incur in the curse of dimensionality; roughly speaking, you need to model $K^2$ correlations between the Bernoullis, as mentioned in Ben's answer;
there are infinitely many possible dependence structures you could choose;
it's unclear whether the dependence structure needs to be parametrized separately (before training the model) or estimated while training the model;
the copula of a discrete random vector is not fully identifiable, which can cause serious inconsistencies.

That said, I think that the easier way to go is to use the product copula (make the Bernoullis independent), which is equivalent to training $K$ separate binomial classification models (one for each label), but with parameter sharing in all the layers but the last one.

Would modifying the copula in a multinomial distribution lead to an Ising distribution? — Dave, Sep 18 '22 at 08:16
I am not familiar with the Ising distribution. I read the Wikipedia article. It seems to me that the multinomial distribution is a special case of the Ising distribution obtained by setting $J_{ij}=\infty$ when $i\neq j$ in the formula for the Hamiltonian $$H(\sigma) = -\sum_{\langle i~j\rangle} J_{ij} \sigma_i \sigma_j - \mu \sum_j h_j \sigma_j$$ — user4422, Sep 18 '22 at 09:38

What is the statistical model for a multi-label problem?

2 Answers2

Linked