In multiclass classification with K classes does cross entropy loss need K outputs or K-1?

Question

Hastie's "The Elements of Statistical Learning" textbook defines the probabilistic model of multiclass logistic regression with K classes as $\forall k \in \{1, \dots, K-1\} $ $$ \ln \frac{p(G=k \mid X=x)}{p(G=K \mid X=x)} = w_k^T x + \beta_k,$$ where each $w_k$ is a vector, random variable $G$ is the class to which random variable $X$ belongs. This model is then fit by maximum likelihood principle (or you could say maximum log likelihood). AFAIK this is kinda equivalent to cross entropy loss. As you can see, we fit $K-1$ linear models.

However, many deep learning classification tutorials fit a neural network with $K$ outputs, which are fed into softmax function and then into cross entropy function. This is less computationally efficient than fitting it to have $K-1$ outputs.

How does this type of logistic regression differ from $K$ linear models with cross entropy loss function? Can I apply the same probabilistic model (and use $K-1$) outputs with neural networks? Why do the tutorials use $K$ outputs?

A little update: perhaps $K-1$ outputs variant is called sigmoid loss, and $k$ outputs variant is called cross entropy loss? And they are different?

In multiclass classification with K classes does cross entropy loss need K outputs or K-1?

0 Answers0