I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer.
Considering the cost function as: $$C(W,B) = (A^{(L)} - Y)^2$$
Then $$Z^{(L)} = W^{(L)}A^{(L-1)} + B^{(L)}$$
$$A^{(L)} = softmax(Z^{(L)})$$
With the chain rule we have that:
$\frac{\partial C}{\partial W^{(L)}}=\frac{\partial C}{\partial A^{(L)}}\frac{\partial A^{(L)}}{\partial Z^{(L)}}\frac{\partial Z^{(L)}}{\partial W^{(L)}}=(2*(A^{(L)} - Y))\frac{\partial A^{(L)}}{\partial Z^{(L)}}A^{(L-1)}$
And $\delta^{(L)}=\frac{\partial C}{\partial A^{(L)}}\frac{\partial A^{(L)}}{\partial Z^{(L)}}$ is useful to compute the delta of the previous layer.
The main problem here is the fact that I don't know how to compute that middle term $\frac{\partial A^{(L)}}{\partial Z^{(L)}}$ with the Softmax as the activation function.
Most of the time I've seen that
$\delta^{(L)}=A^{(L)}-Y$
Where does this simple equation come from?
I'm very new to Backpropagation and I think I'm missing something...