Categorical cross-entropy vs Binary cross-entropy for multi-class classification with mixup

Question

I understand that for multi-class classification the correct loss to use is categorical cross-entropy. However, when performing mixup as a regularisation technique two samples $(X_1, y_1)$ and $(X_2, y_2)$ are combined to create a new sample such that $(X_{new}, y_{new}) = \lambda(X_1, y_1) + (1-\lambda)(X_2, y_2)$, which effectively gives the new sample two labels with different weights.

My question is should I be using categorical cross-entropy because we are classifying non-mixed samples during evaluation, or should I be using binary cross-entropy because the training has effectively become a multi-label classification problem?

Edit: Just to clarify this is a multi-class classification problem where all 100 classes are mutually exclusive, however during training mixup can cause a sample to be labelled with 2 classes where class $i$ has label weight $\lambda$ and class $j$ has label weight $1 -\lambda$. The two losses I am comparing are specifically keras.losses.BinaryCrossentropy and keras.losses.CategoricalCrossentropy. During evaluation, samples can only be labelled with one class.

The new sample is a convex combination of the two inputs. If the input labels match, then the label is either 0 or 1. If the labels don’t match, then the label is either $\lambda$ or $1-\lambda$. In any of these four cases, the BCE loss works because it achieves a minimum when the model predicts the correct label exactly — regardless of whether the label is 0,1 or in between. — Sycorax, Jun 29 '21 at 17:48
@Sycorax perfect explanation, thank you! Additionally, should the output layer be using sigmoid activation as opposed to softmax? On one hand sigmoid is the 'standard' for multi-label with BCE, however I feel softmax may be more suited since the sample labels will always sum to exactly 1. — Avelina, Jun 29 '21 at 18:00
Both sum to 1. For a binary outcome, we can write $P(A) + P(A^c)=P(y=1)+P(y=0)=1$. For binary events, the difference wrt to outputs between sigmoid and softmax is that a sigmoid output solely gives $P(A)=P(y=1)$, while a softmax output gives $P(y=0), P(y=1)$. More broadly, you can show that for 2 classes, sigmoid is a special case of softmax. — Sycorax, Jun 29 '21 at 19:24
@Sycorax yes I completely understand that for the 2 class case, however I have 100 classes, not just 2. — Avelina, Jun 29 '21 at 21:20
Can you [edit] your post to clarify the two losses that you’re comparing? And are the classes mutually exclusive? — Sycorax, Jun 29 '21 at 21:30
The documentation says that keras.losses.BinaryCrossentropy is for the case of 2 classes ("Use this cross-entropy loss for binary (0 or 1) classification applications.") but you have 100. The documentation for keras.losses.CategoricalCrossentropy says "Use this crossentropy loss function when there are two or more label classes." Does this answer your question? — Sycorax, Jun 29 '21 at 21:55
@Sycorax that's what it says, however it can be used with more than 2 classes. I looked at the source code and when there is more than 1 output logit it simply computes BCE for each logit and returns the mean. There are also dozens of online tutorials which use BCE for multi-class multi-label classification in keras. — Avelina, Jun 29 '21 at 22:00
I'm surprised that the source code is doing that. I guess my question to you is "what is the negative likelihood that you want to minimize?" It's not necessarily the case that Keras will implement a loss for the likelihood that you care about. I could see a case for either one, or some third option, depending on how you're thinking about your data. — Sycorax, Jun 29 '21 at 22:01
For instance, this concept is developed in the context of pixel intensities here https://stats.stackexchange.com/questions/206925/is-it-okay-to-use-cross-entropy-loss-function-with-soft-labels and here https://stats.stackexchange.com/questions/490062/can-we-derive-cross-entropy-formula-as-maximum-likelihood-estimation-for-soft-la — Sycorax, Jun 30 '21 at 21:48

Categorical cross-entropy vs Binary cross-entropy for multi-class classification with mixup

0 Answers0