7

In a setting with a binary $y$ like dog/cat, a reasonable statistical model is to posit that the probability parameter $p$ of a $\text{Binomial}(1, 0)$ distribution is some function $f$ of features $X$. This leads to many common machine learning approaches like logistic regression and neural networks.

In a setting with multiple classes in $y$, such as dog/cat/horse, a reasonable statistical model is that the multiple probability values $\vec p$ of a $\text{Multinomial}(1, \vec p)$ distribution is some function of features $X$. Much like in the binary setting, this leads to many common approaches like multinomial logistic regression and various forms of deep learning (e.g., convolutional neural networks for MNIST handwritten digits).

In a multi-label setting, such as identifying if a photograph contains a dog, a cat, or both, what would be the statistical model?

(To a large extent, I think I mean the formal likelihood, regardless of the functional form of expressing the probability of class membership as depending on the features, but I want to leave it a bit vague to allow for an answer that I’m thinking about it wrong by thinking of statistical likelihood.)

EDIT

The comments have clarified that an Ising likelihood works. With that being the case, how do the prior probabilities of the classes come into the picture? For instance, in a logistic regression, if I have $99$ $0$s for every $1$, I expect low probabilities of $1$ unless the features are extremely informative. In a multi-label setting, it seems like the prior probability of each class would be the ratio of that class to all possible alternatives, which I would consider to be zero: out of every possible sight there is to see, the probability of seeing a dog ought to be tiny or even zero (and the fact that we’re on Earth (for now) and near dogs is what allows us to see dogs with frequency).

It seems like something like this fails for multi-label classification.

Dave
  • 62,186
  • is this a job for an Ising distribution? – John Madden Aug 21 '22 at 04:13
  • @JohnMadden I remember Ising models from a stochastic processes class, but I don’t remember much, and my quick read of the Wikipedia article does not clarify what you might mean. Perhaps you could clarify or even write an answer to elaborate. – Dave Aug 21 '22 at 04:20
  • 3
    For the dog cat binaries, you'd have a multivariate distribution over ${0,1}^2$ (neither, dog only, cat only, both). – Glen_b Aug 21 '22 at 04:30
  • @Glen_b I could get on board with that! For a multi-label problem with dog/cat/horse as the possible labels, would it then be ${0,1}^3$ with three probabilities returned (all of which could be small for an input image of, say, a crocodile)? // And do you mean this as a way to tie the problem to the Ising model? I could see that making sense for the lattice in the Ising model. – Dave Aug 21 '22 at 04:49
  • 1
  • ${0,1}^3$ would have 8 probabilities, not 3; with your dog/cat example, "both" was a possibility. If a picture could have both a dog and a cat, presumably there's 4 possibilities (no dog and no cat, dog but no cat, cat but no dog, cat and dog). With 3 animals, that's 8 possibilities. 2. I was making no comment in relation to the Ising model; it's probably 30 years since I looked at that and I'd have to go look it up to I remember what it was.
  • – Glen_b Aug 21 '22 at 06:10
  • The Ising distribution is just a multivariate Bernoulli with arbitrary dependence, specified with a matrix $\Omega$ (and "main effects" specified by vector $\rho$). So in this case it would allow for arbitrary probabilities assigned to the four cases {None, {Dog},{Cat},{Dog,Cat}}, and does so in a way that's nice to further parameterize if you want. It's hard to estimate an arbitrary dense $\Omega$, so common practice is to impose some kind of structure, such as low rank https://www.nature.com/articles/srep09050 or spatial/lattice neighborhood structure. – John Madden Aug 21 '22 at 16:01
  • @JohnMadden I’m totally on board with the Ising model being the likelihood. However, the issue of a prior probability (see edit) still confounds me, and I cannot consider the statistical model to be complete unless I understand that. – Dave Sep 08 '22 at 15:41