Confused with binary cross-entropy vs categorical cross-entropy

Question

I have a dataset with 10 input categorical features and one output categorical feature with class 0 and 1. X_train follows a 3D array so I have done label encoding beforehand on the dataset. I have applied categorical_crossentropy but I am getting 26% accuracy with activation function sigmoid. When I apply binary_crossentropy, the accuracy drastically increased to 98%.

model = Sequential()
model.add(LSTM(256, input_shape=(n_timesteps,n_features),recurrent_activation='hard_sigmoid'))
model.add(Dense(16))
model.add(Dense(n_outputs, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

and dataset is divided as:

X_train:  (430000, 5, 10)
y_train:  (430000, 1)

What is the value of n_outputs? Note that there are circumstances when the two losses are equivalent, but it's not clear that those circumstances exist in your code; see https://stats.stackexchange.com/q/260505/22311 — Sycorax, Feb 27 '22 at 14:19
n_outputs is 1 and the model is predicting only class 0, not 1. Please suggest how can I improve. — be_real, Feb 28 '22 at 09:33

Sycorax · Accepted Answer · 2022-02-28T14:38:57.417

There are circumstances when the two losses are equivalent, but those circumstances do not exist in OP's code.

In a comment, OP writes that they only have one output neuron.

With 1 output neuron and binary cross-entropy, the model outputs a single value $p$ abd loss for one example is computed as

$$ L_b = -y \log p - (1 - y) \log (1 - p), $$ which is the correct way to compute the loss.

However, with 1 output neuron and categorical cross-entropy, the loss is computed as

$$ L_c = -y \log p $$

which is clearly different because it fixes $(1-y) \log(1-p)=0$. This loss is obviously bogus because it is minimized at $L_c = 0$ by setting $p=1$ regardless of the input, resulting in a totally useless model.

To use categorical cross-entropy correctly, OP needs to make these changes

use $k$ output neurons (one for each of the $k$ classes). In OP's particular case, $k=2$
these output neurons need to be a probability vector: the neurons sum to 1 for all inputs, and all values are non-negative. The standard way to do this is to use a softmax activation in the output layer.

After making these changes, the loss will be computed correctly when using categorical cross-entropy. This is because what we want to have is the model outputs $p_1, p_2$ so that the loss is

$$ L_c = -y \log p_1 - (1 - y) \log p_2 $$

where $ 0 \le p_i \le 1$ and $p_1 + p_2 = 1$. In this setting, it's simple algebra to show that $L_c = L_b$, as desired.

thank you for the explanation!! – be_real Mar 01 '22 at 14:22 — be_real, Mar 01 '22 at 14:22

Confused with binary cross-entropy vs categorical cross-entropy

1 Answers1

Linked