2

Let's say I have a binary classification problem; two classes $A$, $B$ that I want to predict from a bunch of features.

But, I am also given a (predicted - will be close to accurate) probability that it is class $A$ and a probability that it is class $B$. Let's say $0.78, 0.22$ respectively. It is these probabilities that I want to target in my problem.

If I just go the standard route of $BinaryCrossentropy()$ loss, then the net knows nothing about these probabilities I want to target.

Does this now turn in to more of a regression problem? Or maybe a classification problem but with a custom loss function?

the man
  • 265
  • Do you mean that you want to predict the class membership probabilities from measurements of observed classes, or do you not have the exact observed classes? The former is what neural networks explicitly do (sort of), while the latter would be using so-called “soft” labels. In many cases, I have my doubts that the soft labels have to be used (but I can see situations where they are necessary, too). – Dave Feb 16 '24 at 13:35

2 Answers2

3

Consider the binary cross-entropy loss $L = -y \log f(x) + (1-y) \log (1-f(x))$ with binary labels $y \in \{0,1\}$ and our model produces predictions $f(x)$ from the features $x$. If we $n$ observations with the same values $x$ for the features, but not all the same $y$ values, then the average loss for these $n$ samples is given by

$$ \frac{1}{n} L = -\frac{k}{n} \log f(x) - \frac{n-k}{n} \log(1 - f(x)) \\ $$ where $k$ is the sum of $y$ for these $n$ samples.

So for the case of having a probabilistic label $P(A) = 0.78$ and $P(\lnot A)=0.22$, the loss is given as $$L = -0.78 \log f(x) - 0.22 \log(1 - f(x)), $$ which matches the average loss for data arising from $78$ observations of $A$ and $22$ observations of $\lnot A$ where all 100 observations have the same feature vector $x$.

This shows that we can generalize cross-entropy loss with binary labels to a loss on probabilistic labels.

And we can reach a similar conclusion using maximum likelihood estimators: Machine Learning with Aggregated Frequency Data as Training

There is one caveat here, which is that the probabilistic labels are not identical to modeling data where $n$ trials result in $k$ successes (binomial data). This is because the probabilistic labels suppress the number of trials $n$, whereas the binomial model is weighted by the number of trials $n.$ The resulting models will differ whenever there are unequal $n$s in the data. So if we only have two distinct feature vectors $x_1$ and $x_2$ and wildly differing sample sizes, it's easy to see that the likelihoods differ by inspection.

This caveat illustrates an exposes an important nuance to the question: whether using this weighted cross-entropy model is appropriate depends on how the data were collected and how the probabilistic labels constructed.

Sycorax
  • 90,934
2

A very simple way would be to just turn your problem into a "regression" one with squared loss. (I am using scare quotes because the distinction between regression and classification is, in my opinion, spurious from the very beginning.)

Note that the square loss, in the case of "classification", is the Brier score, which is a proper scoring rule - as is the cross-entropy, which as a scoring rule is known as the log score. This provides enough rationale to use either square loss or the log loss per Sycorax' answer. This thread may also be useful for a comparison between the two scores: Why is LogLoss preferred over other proper scoring rules?

Stephan Kolassa
  • 123,354