Binary numbers instead of one hot vectors

Question

While doing logistic regression, it is common practice to use one-hot vectors as desired result. So, no of classes = no of nodes in output layer. We don't use index of word in vocabulary because that may falsely indicate closeness of two classes. But why can't we use binary numbers instead of one-hot vectors?

i.e if there are 4 classes, we can represent each class as 00,01,10,11 resulting in log(no of classes) nodes in output layer.

@Carl one hot vector = all elements 0 except one element that is 1. E.g. 000000100 — Franck Dernoncourt, Oct 24 '16 at 13:31

score 4 · Accepted Answer · answered Oct 23 '16 at 21:52

Using binary numbers instead of one-hot vectors introduces dependencies between the different classes.

For example, if

number: class name
00: blue
01: red
10: black
11: green

then the first bit should be 0 whenever it is blue or red. This may confuse the classifier.

However, the approach can sometimes be useful, e.g. see hierarchical output layer where the choice of the tree can significantly impact the performance of the network.

{1} report the following though:

The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation.

so you may want to explore methods other than hierarchical softmax when doing speed optimization.

{1} Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." In Advances in neural information processing systems, pp. 3111-3119. 2013. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf ; https://scholar.google.com/scholar?cluster=2410615501856807729&hl=en&as_sdt=0,22

score 1 · Answer 2 · answered Jul 30 '17 at 15:34

As mentioned by Franck, there are tradeoffs for sure. In some cases where the categorical variable has a very large number of categories, one hot encoding can blow out the dataset, which also isn't great for some classifiers/datasets. So in those cases, pragmatically, trading off some encoding quality for a smaller dataset can make some sense.

If you want to experiment with it, you can use the BaseN encoder in https://github.com/scikit-learn-contrib/categorical-encoding. It lets you specify a base (base-1 is taken to be equivalent to one-hot, base-2 is the binary case, etc.), and then you can do a grid search or something like that to find a base that works out. But, if you can get away with it, one-hot is going to represent the categories more correctly always.

Binary numbers instead of one hot vectors

2 Answers2