0

I am reading this article: https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative about softmax derivation w.r.t. to the input. Please confirm my understanding of the case when j==i:

\begin{equation} DjSi=\begin{cases} S_j(1-S_i), & \text{if $i= j$}.\\ -S_jS_i, & \text{otherwise}. \end{cases} \end{equation}

If $ i = j $ so $ S_j(1-S_i)$ is equal to $S_j - S_j^2$ or $S_i - S_i^2$ because if $i = j$ then $S_j = S_i$.

artona
  • 125
  • 4

2 Answers2

1

Ok, to begin with your last statement, I can't see the difficulty there. Yes, since $i = j$, then $S_i = S_j$. Therefore, $S_j(1 − S_i) = S_i − S_i^2 = S_j − S_j^2$.

If you are asking on why keeping the different $i$ and $j$ indices for the first case, when it'd be more clear to have just one, e.g. $S_i(1 − S_i)$, well... authors.

As to why we have these expressions, recall the definition of softmax:

$$S_i(\mathbf z) = \frac{e^{z_i}}{\sum_k{e^{z_k}}}$$

By taking the derivative with respect to the $j$-th entry of vector $\mathbf z$, we get:

$$\partial_jS_i(\mathbf z) = \frac{\sum_k{e^{z_k}}\times\partial_je^{z_i} - e^{z_i}\times\partial_j\sum_k{e^{z_k}}}{(\sum_k{e^{z_k}})^2}$$

Now, we have two cases. First, $i \neq j$:

$$\partial_jS_i(\mathbf z) = \frac{\sum_k{e^{z_k}}\times0 - e^{z_i}\times e^{z_j}}{(\sum_k{e^{z_k}})^2} = - \left(\frac{e^{z_i}}{\sum_k{e^{z_k}}}\right) \left(\frac{e^{z_j}}{\sum_k{e^{z_k}}}\right)$$

Which is equal to $-S_iS_j$.

For $i = j$, we get:

$$\partial_iS_i(\mathbf z) = \frac{\sum_k{e^{z_k}}\times e^{z_i} - e^{z_i}\times e^{z_i}}{(\sum_k{e^{z_k}})^2} = \left(\frac{e^{z_i}}{\sum_k{e^{z_k}}}\right) \left( \frac{\sum_k{e^{z_k}} - e^{z_i}}{\sum_k{e^{z_k}}} \right)$$

Which, after dividing and replacing, becomes $S_i(1 − S_i)$.

  • To be honest I have not met such expression of the Softmax gradient / derivation w.r.t. input as I have read in the cited article. So I have some doubts about this approach. But thanks to you it is now clear that my understanding was correct. Many thanks! – artona Aug 03 '18 at 16:30
1

When applicable, matrix/vector notation can be less cluttered than index notation.

Given the vectors $x$ and $s={\rm softmax}(x)$, the gradient of the latter is the matrix $$G= \frac{\partial s}{\partial x} = {\rm Diag}(s)-ss^T$$

greg
  • 336