4

Given a $K$-dimensional real-valued vector $\mathbf{z} = (z_1, z_2, \ldots, z_K)$, I know that the softmax function returns a vector $\sigma(\mathbf{z})$ with positive elements summing to 1 via the following formula:

$$ \sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}, j = 1, \ldots, K $$

Recently a colleague mentioned to me a variant of the softmax function, which I'll call $\sigma'(\cdot)$, that takes the following form:

$$ \sigma'(\mathbf{z})_j = \frac{\alpha^{r(\mathbf{z}, j)}}{\sum_{k=1}^K \alpha^{r(\mathbf{z}, k)}}, j = 1, \ldots, K $$

Here, $\alpha > 0$ is a constant and $r(\mathbf{z}, j)$ is the rank of $z_j$ within the vector $\mathbf{z}$, so the smallest value takes rank 1, the second smallest rank 2, and so on. The largest index takes rank $K$.

Does this softmax variant (or a similar one based on the ranks of a vector's values instead of the values themselves) have a name, and is it used in practice?

josliber
  • 4,367
  • 28
  • 43

2 Answers2

1

I could not find this variant after quite a bit of Googling, Binging, Duckduckgoing... whoever uses it probably keeps it to themselves.

I found this paper on Time series forecasting using a weighted cross-validation evolutionary artificial neural network ensemble by Donate et al. (Neurocomputing, 2013) where as an alternative to softmax weighting they use rank-based weight in the same way as in the $\sigma^\prime$ where they set $\alpha = e$; (2.7182...) but that's about it.

I suspect you know of the hierarchical softmax function as a variant of the softmax (or normalized exponential) but this is not related to ranks. A distant relation I can see is with the softrank function by Taylor et al., which is the normalized discounted cumulative gain (see here for an application in a GP setting).

josliber
  • 4,367
  • 28
  • 43
usεr11852
  • 44,125
0

Yes, it is used in practice, generally in winner-take-all situation or situation where you want to assign 1 (close to) to maximum value and 0 (close to) to all others. And it is still called softmax.

If you think about it, the value for $\sigma'(\mathbf{z})_K$ will be very close to 1 and value for all others will be close to 0. See this paper titled "Softmax to Softassign: Neural Network Algorithms for Combinatorial Optimization" for more information:

https://www.researchgate.net/profile/Anand_Rangarajan/publication/2458489_Softmax_to_Softassign_Neural_Network_Algorithms_for_Combinatorial_Optimization/links/0deec5214b1cb9dbde000000.pdf#page=6

  • Regarding your statement in the second paragraph, the highest ranking elements is only selected with high probability for large $\alpha$ -- note that all elements are selected with equal probability for $\alpha=1$ and the smallest element is more likely with $\alpha < 1$. Unless I missed it, the linked paper appears to only use the standard softmax ($m_j = \frac{\text{exp}(\beta X_j)}{\sum_{i=1}^I \text{exp}(\beta X_i)}$ on Page 6). Is there somewhere in that paper where you saw a softmax variant based on the ranking of elements? – josliber May 08 '16 at 00:43
  • On page 6, they are increasing the value of $\beta$ for each $m_{i}$. In your case, if you are assuming $\alpha$ is probability, then yes, the lowest possible value will be close to 1. Basically, what your function is doing is that it is making the value of maxima/minima equal to 1 and rest of the elements 0. Does this help? Or you already knew this but wanted to see if there is a name for such softmax variant ? – The Wanderer May 08 '16 at 02:22
  • I certainly understand what $\sigma'(\cdot)$ is doing, but I'm wondering if it has a name and is widely used. I still don't see how the example on Page 6 is related to my question -- the same $\beta$ value is being used to compute each $m_i$ value in a given iteration (though $\beta$ is changed between iterations of Procedure A on that page); further, the rank of the elements isn't mentioned on Page 6. – josliber May 08 '16 at 03:28
  • hmm i see. Then yes, even I couldn't find anything. – The Wanderer May 08 '16 at 03:32