Multiclass Classification using Binary Representation and Sigmoid Activation in Neural Network

Question

I am currently working on a multiclass classification problem where I have categorical variables that I've encoded using binary representations as follows:

0 -> 00
1 -> 01
2 -> 10
3 -> 11

My approach involves using a neural network with two output neurons, each having a sigmoid activation function. I use Mean Squared Error (MSE) as the loss function during training.

At inference time, I assign 0 as the output when the activation is less than 0.5; otherwise, I assign 1 for both neurons.

I am wondering if this approach is appropriate for a multiclass classification problem. The only advantage I can see now is fewer weights between the last and second last layer.

Does one neuron consider if the category is dog vs cat and the other neuron consider if the photo is daytime or nighttime (so to speak)? Or do you have four mutually exclusive categories? If you are in the latter case, what would you do if you had a number of categories that isn’t equal to an integer power of two? — Dave, Jul 22 '23 at 15:21
ceil(log(num_classes)) number of neurons should be sufficient. Example encoding for 5 classes :
000 -> 0 001 -> 1 010 -> 2 011 -> 3 100 -> 4 — palash behra, Jul 22 '23 at 16:41
What happens when all three neurons give probabilities exceeding $0.5$ (or whatever threshold you pick)? There is not a 111 category. — Dave, Jul 22 '23 at 17:06
What do you mean by an “inferencing error”? I do not know that as either technical of colloquial terminology. — Dave, Jul 22 '23 at 17:19
At training time, the error should contribute towards updating the parameters, and at testing, it will be just an error, as in an inaccurate prediction. — palash behra, Jul 22 '23 at 17:35
Are you getting multiple output neurons with values above $0.5?$ There are ways to do it so that can happen, but the usual implementation with two output neurons is one where the sum of the outputs cannot exceed $1$. — Dave, Jul 23 '23 at 20:55
Are you suggesting that I use SoftMax instead of sigmoid on last layer? — palash behra, Jul 25 '23 at 05:38
So are you not ever getting multiple output neurons above $0.5?$ — Dave, Jul 25 '23 at 11:55
Okay, then I guess you’re (correctly) approaching this as a multi-label problem. — Dave, Jul 25 '23 at 15:09
This is a multi-class problem. Suppose the output layer for a forward pass is {0.8, 0.4, 0.9} It gets hard labelled to {1, 0, 1} Which finally gets decoded to 5 — palash behra, Jul 25 '23 at 16:24
Those probabilities cannot happen for a multiclass problem, as their sim exceeds $1$. They can happen for a multi-label problem. While your overall problem is multiclass, your encoding turns the problem into a multi-label problem. — Dave, Jul 25 '23 at 16:35

Dave · Accepted Answer · 2023-07-23T16:39:22.497

This could be clever. After all, more parameters means more potential to overfit, so if the parameter count can be reined in without sacrificing flexibility, this should be good news.

I see a few issues that are worth considering.

There is no natural interpretation of the explicit model outputs. You cannot interpret each neuron value as the probability of a particular outcome, and such probability values can be useful.

Consequently, you must rely on threshold-based classifications where being on the “wrong” side of a threshold results in a huge penalty. Similarly, there is no additional penalty for being extremely wrong as opposed to just slightly wrong.

If you go beyond four categories, you wind up with “empty” classifications. For instance, if you have five categories and do something with three binary neurons, you will wind up with three possibilities that do not correspond to any particular category. Sure, you can just regard those as mistakes when you are training and tuning the model, but it is not clear how to regard those when they are predicted by a deployed model.

You’re minimizing square loss instead of using the multinomial (log) likelihood as the optimization criterion. We’re in a lucky situation where we know form of the conditional distribution: multinomial on one roll of the die. This means that we can be confident about using maximum likelihood estimation and getting all of the benefits of maximum likelihood estimation. Frank Harrell has gone so far as to describe fitting a model like this by minimizing square loss as “silly”. Thus, while using such a model might be able to slightly reduce the parameter count, if those parameters are not estimated as well as they otherwise would have been, model performance might not improve as much as desired, if at all.

Multiclass Classification using Binary Representation and Sigmoid Activation in Neural Network

1 Answers1