1

I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer.

Considering the cost function as: $$C(W,B) = (A^{(L)} - Y)^2$$

Then $$Z^{(L)} = W^{(L)}A^{(L-1)} + B^{(L)}$$

$$A^{(L)} = softmax(Z^{(L)})$$

With the chain rule we have that:

$\frac{\partial C}{\partial W^{(L)}}=\frac{\partial C}{\partial A^{(L)}}\frac{\partial A^{(L)}}{\partial Z^{(L)}}\frac{\partial Z^{(L)}}{\partial W^{(L)}}=(2*(A^{(L)} - Y))\frac{\partial A^{(L)}}{\partial Z^{(L)}}A^{(L-1)}$

And $\delta^{(L)}=\frac{\partial C}{\partial A^{(L)}}\frac{\partial A^{(L)}}{\partial Z^{(L)}}$ is useful to compute the delta of the previous layer.

The main problem here is the fact that I don't know how to compute that middle term $\frac{\partial A^{(L)}}{\partial Z^{(L)}}$ with the Softmax as the activation function.

Most of the time I've seen that

$\delta^{(L)}=A^{(L)}-Y$

Where does this simple equation come from?

I'm very new to Backpropagation and I think I'm missing something...

  • it's rather strange to use MSE together with a softmax (one would usually want cross entropy loss) 2) the softmax function has a wonderfully simple derivative: $\frac{\partial \mathbf{A}}{\partial \mathbf{Z}} = \sigma(\mathbf{Z})(1-\sigma(\mathbf{Z}))$, where this multiplication is to be understood elementwise (and $\sigma$ represents the softmax). 3) But why calculate gradients yourself? We have autodiff :)
  • – John Madden Jul 18 '23 at 14:40
  • MSE because I thought it was easier to understand and handle.
  • I would like to understand how the derivative is calculated.
  • I am doing all this 'manually' because I would like to know exactly what happens 'under the bonnet'.
  • – Dario Ranieri Jul 18 '23 at 14:47
  • notice that the cost function changes only the gradient contribution of the output layer, so switching out cost functions only causes small changes to our overall calculations. 2) for a derivation of the result I shared, see (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3) in my humble opinion, seeing how a particular gradient is calculated does not help understanding of how neural nets are trained nearly as much as learning about how the optimizer is going to make use of that gradient.
  • – John Madden Jul 18 '23 at 14:52