Rather than using Cross Entropy Loss and one hot encoding for neural network classification tasks, if my model outputs a single scalar value and I use mean squared error loss what will happen?
-
What do you mean that the mode outputs a single scalar? What is the alternative? – Dave Feb 21 '24 at 03:05
-
Please explain whether (a) you have only two classes coded 0 and 1, and your output is a number between 0 and 1 that can be interpreted as a probability (in which case the MSE is the Brier score, and cross entropy is the log loss, and your question is a duplicate of this one), or (b) you have $n$ classes encoded $1$ to $n$, and your output is a number between $1$ and $n$ (in which case Zurienu's answer applies). – Stephan Kolassa Feb 21 '24 at 07:14
1 Answers
My answer to this is: It's terrible in most situations, and I wouldn't recommend it, except it the classes are values on an interval or ratio scale, or you have two classes.
First, I'll clarify some definitions and the answer will follow.
One-hot encoding is basically a function that takes a class index, say $i \in \{1, 2, 3, \ldots, k\}$, with $k$ being the number of different classes in your classification task, and returns a vector of only 0s except in the i-th position, where it's a 1. The goal is to encode the "belonging of a class" as a positional bit vector, instead of as a single integer. However, this vector could also be seen as a discrete probability distribution with all the probability put on the class the datum belongs to. I'll name this function $\mathcal{O}$, for One-hot encoding. So if $k=3$, then $\mathcal{O}(2) = [~0\quad 1\quad 0~]$.
On the other hand, cross-entropy is a function derived from a divergence function (similar but different to a distance function), and it aims to give a score to how "unlikely" a data from distribution Q would be seen from distribution P. So if $H$ is the cross entropy function: $$H(P, Q) := -\sum_{x} P(x) \cdot log(Q(x)).$$
In neural network (NN) classification tasks, you would use the cross-entropy function to penalize the output of your NN, namely a vector of size $k$. Say your NN is a function $\mathcal{N} : \mathbb{R}^m \to (0,1)^k$, with $m$ being the dimension/size of the input. Note that I mention that the output is in $(0,1)^k$, which is the space of all vectors with elements between 0 and 1, exclusively. You could achieve such output with a Softmax layer, etc.
Let's suppose also that we write a data pair as $(x_i, {\rm class}_j)$, where $x_i$ is the ith input datum and ${\rm class}_j$ is its associated class. The cross-entropy thus compares $\mathcal{O}({\rm class}_j)$ with $\mathcal{N}(x_i)$, comparing the probability distribution of "certainty of class $i$", to what could be considered an estimate of the probability distribution that $x_i$ is in each class, given by the NN.
(And this approach is so intuitive, that it actually is equivalent to an estimation paradigm from Statistics, named Maximum Likelihood Estimation.)
But what would happen if, instead of comparing a vector output with the one-hot encoding, you let your neural network output a single real number and compare it with the class $i \in \{1, 2, 3, \ldots, k\}$? Well, if the output can be anywhere between -$\infty$ and $\infty$.. it makes little sense to compare, ... for example, 836.1 to the classes' numbers. If $k$ is 14 and current class output is 12, the squared error is $|12-839.1|^2 = 684~094.41$. It's certainly useful to minimize this value, so that the output of the NN is around the classes integers, but then... your NN can output values outside the classes' integers set. You need a rule to fit the value to a class, since we're doing classification.
If otherwise the output is strictly in the set of indices ($\mathcal{N} : \mathbb{R}^m \to \{1, 2, 3, \ldots, k\}$), then comparing the actual class with the output would end up comparing pairs like $\mathcal{N}(x_j)=3$ and ${\rm class}_j = 8$, thus penalizing the squared distance between numbers which... don't necessarily have any numerical meaning. Does the gap between class 3 and class 7 have the same numerical meaning as the difference between class 1 and class 5? If the answer is yes, then you aren't really doing classification (or more aptly said, you use a classification method on a regression problem disguised as a classification problem). It is so because the scale you use is on an interval or ratio scale, as I mentioned in the summarized answer.
So that's the first exception to using mean squared error (MSE) on a scalar for classification problem.
The other exception is if the class number is $k=2$, but I would suggest having your NN output a value between 0 and 1 instead of a real number (for stability, mostly). I can explain this suggestion in more details if anyone asks.
- 46