5

Consider a binary classification dataset (X, Y), generated according to some unknown distribution $P(X, Y)$. I have a question about models which output probabilities by minimizing the cross-entropy loss (logistic regression and deep models using a final softmax layer).

  • do these models attempt to predict the true conditional probability $P(Y|X)$?
  • or do they aim for a weaker result, like for example trying to get the order between the classes right?
Richard Hardy
  • 67,272
usual me
  • 1,227
  • 2
  • 10
  • 17

1 Answers1

3

Minimizing crossentropy loss is equivalent to maximum likelihood estimation of the regression coefficients and thus to maximum likelihood estimation of the conditional probabilities. (Since the probabilities are a decent function of the regression coefficients, the MLE of the probabilities is that same decent function of the regression coefficient MLE.)

Consequently, yes, the models seek out something like $P(Y\vert X)$, but they go beyond that. When you have multiple classes (more than just two), these models are supposed to seek out the true probabilities of all classes, not just the one with the highest probability.

Neural networks are known for giving poor probability predictions, however, so I will leave a link to a related question of mine about calibration of the probabilities of the non-dominant classes.

Dave
  • 62,186
  • Doesn't your MLE argument assume that the data is generated by a logistic regression model of unknown parameter? That's not true in the general case. – usual me Nov 20 '22 at 06:23
  • @usualme are you trying to ask if logistic regression can approximate any functional form of $P(y|X)$, then you answered yourself, the answer is no. – Tim Nov 20 '22 at 09:02
  • @Tim Thanks, now I understand the logistic regression case. What about deep neural networks, which are universal approximators? Do they predict the true $P(y|X)$? – usual me Nov 20 '22 at 12:30
  • 1
    @usualme with an infinitely large network and an infinite amount of data it's guaranteed, but it still the function needs to meet some conditions (continuous, bounded). – Tim Nov 20 '22 at 12:59
  • @usualme I strongly recommend you asking that as a follow-up question and linking it here – Silverfish Nov 20 '22 at 21:46