How does softmax relate/backpropagate probabilities with binary label?

Question

When the softmax is the last layer in the neural network, it usually takes input from a fully connected layer (say of size 1000) and then outputs probabilities of each of this output (if I understand it correctly).

Assuming that our task is to recognize an object, how does the layer helps in propagating back the information that given image contains an object or not?

score 2 · Answer 1 · edited Mar 11 '16 at 07:48

I'll assume you understand backpropagation with a sigmoidal function. If not please tell me and I'll edit this answer.

First lets observe both the sigmoid and softmax functions for a given output $ x_i $ being x the vector with all outputs of this given layer.

The sigmoid is: $ f_i(x) = \frac{1}{1 + e^{-x_i}} $, while the softmax is $ g_i(x) = \frac{e^{x_i}}{\sum_{j=1}^N e^{x_j}} $

Then the sigmoid derivative is $ \frac{\partial f_i(x)}{\partial x_j} = \left\{ \begin{matrix} f_i(x) (1 - f_i(x)) & i = j \\ 0 & i \neq j \end{matrix} \right. $

while the softmax derivative is $ \frac{\partial g_i{x}}{\partial x_j} = \left\{ \begin{matrix} \frac{e^{x_i} \left( \sum_{j=1, j \neq i}^N x_j \right)}{\left(\sum_{j=1}^N e^{x_j} \right)^2} & i = j \\ - \frac{e^{x_j + x_i}}{\left(\sum_{j=1}^N e^{x_j} \right)^2} & i \neq j \end{matrix} \right. $

As you can see $ \frac{\partial g_i{x}}{\partial x_j} > 0 $ so it will cause weights related to that output to have a positive reinforcement, just like with the sigmoid.

But for all other weights $ \frac{\partial g_i{x}}{\partial x_j} < 0 $ it will cause a negative reinforcement, different from the sigmoid that does not update them.

Depending on your problem this may help convergence occur faster at a higher computational expense (calculating the softmax is slower than the sigmoid).

In order to compute the predicted label will be the one which contains the maximum output value (thus, the name softmax). With that in mind the output vector $ o $ can be expressed as

$ o_i = \left\{ \begin{matrix} 1\;,&\; i = \text{argmax}[g(x)] \\ 0\;,&\; otherwise \\\end{matrix} \right.$

But the final label is 0 or 1 (the image contains the object or not). How is this binary label related to the output of the softmax? — cerebrou, Mar 10 '16 at 06:35
And doesn't the denominator contain the sum of the exponentials, instead? — cerebrou, Mar 10 '16 at 06:35
Updated my answer to fix the typo on the softmax equation and add an explanation about how is the softmax output related to the classification. — Carlos Massera Filho, Mar 10 '16 at 20:53
I edited the derivative denominator, as well. Please check it. — cerebrou, Mar 11 '16 at 06:07
Do we then backpropagate the values of 1 and 0 through the softmax derivative? — cerebrou, Mar 11 '16 at 06:07

How does softmax relate/backpropagate probabilities with binary label?

1 Answers1