Why 'e' in softmax?

Question

I am doing an introduction to ML with tensorflow and I came across softmax activation function. Why is in the softmax formula e? Why not 2? 3? 7?

$$ \text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)} $$

$$ \begin{eqnarray} \sum_j a^L_j & = & \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1. \tag{79}\end{eqnarray} $$

Tensorflow tutorial

NN book

Possible duplicate of What is the reason why we use natural logarithm (ln) rather than log to base 10 in specifying function in econometrics? — Tim, Aug 06 '17 at 12:51
@Tim I think the answer to that question really doesn't get at the heart of the issue here. Usually you're not trying to interpret the softmax variables in the same way as you would with functions in econometrics. I thought it was more that it was easier to calculate the derivatives of softmax. — John, Aug 06 '17 at 13:16
@Tim I understand why compounded interest limit yields 'e' but I am unable to transpose the reason for this connection into softmax. — Gillian, Aug 06 '17 at 13:36
It seems it could be any arbitrary base and it would give us approximatelly (maybe even precisely?) correct image of what the distribution looks like. And working with irrational numbers such as 'e' surely slows down computation. — Gillian, Aug 06 '17 at 13:44
How does this slow the computation? In any case you'd be dealing with floating-point numbers... — Tim, Aug 06 '17 at 13:48
@Tim I assume that there must be difference between raising an integer and a transcendental number to the same power. — Gillian, Aug 06 '17 at 14:14
Make an experiment and check how much computation time would you save if you used floating point numbers vs integers in here. This would be negligible, especially when you'd use it inside a complicated algorithm like neural networks. Laziness is a virtue of a programmer, don't waste your time for useless optimizations. — Tim, Aug 21 '17 at 09:22
Late comment; This originates from statistical physics, i.e., Boltzmann's probability over states. — patagonicus, Sep 15 '22 at 13:53

Matthew Gunn · Accepted Answer · 2020-08-31T15:07:50.243

13

Using a different base is equivalent to scaling your data

Let $\mathbf{z} = \left(\ln a\right) \mathbf{y}$

Now observe that $e^{z_i} = a^{y_i}$ hence:

$$ \frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{a^{y_i}}{\sum_j a^{y_j}}$$

Multiplying vector $\mathbf{y}$ by the natural logarithm of $a$ is equivalent to switching the softmax function to base $a$ instead of base $e$.

You often have a linear model inside the softmax function (eg. $z_i = \mathbf{x}' \mathbf{w}_i$). The $\mathbf{w}$ in $\mathbf{x}' \mathbf{w}$ can scale the data so allowing a different base wouldn't add any explanatory power. If the scaling can change, there's a sense in which different base $a$ are all equivalent models.

So why base $e$?

In exponential settings, $e$ is typically the most aesthetically beautiful, natural base to use: $\frac{d}{dx} e^x = e^x$. A lot of math can look prettier on the page when you use base $e$.

edited Aug 31 '20 at 15:07

answered Aug 06 '17 at 14:27

Matthew Gunn

22,329

Would the function be the same if e to the z was replaced with ln(z) ? – Jack Vial Nov 09 '17 at 17:05
@Jack I'm not sure I follow what you specifically had in mind? – Matthew Gunn Nov 09 '17 at 17:24
If e^x = ln(x). Can the softmax function be written using ln(x) instead of e^x? – Jack Vial Nov 09 '17 at 18:01
You could write the softmax as $\operatorname{Softmax}(\mathbf{x})i = \frac{e^{x_i}}{\sum{j=1}^k e^{x_j}} = \exp\left(\ln\left( \frac{e^{x_i}}{\sum_{j=1}^k e^{x_j}} \right) \right) = \exp\left( x_i - \ln\left( \sum_{j=1}^k e^{x_j} \right)\right)$. But other than that, I really don't know what you're looking for? – Matthew Gunn Nov 09 '17 at 18:42
2

MatthewGunn I think he wanted to know if rewriting the softmax equation to: $\text{softmax}(x)_i = \frac{\ln(x_i)}{\sum_j \ln(x_j)}$ would yield the same result as $$\text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$$. However the statement @Jack made that: "$e^x =ln(x)$" isn't true, so I am not sure where he was going with that – Sebastian Nielsen Oct 22 '18 at 21:45
I have a question. The only reason we raise the independent variable as an exponent (x, y, z - whatever you want to call it) is so that we can avoid negative values canceling out postive values right? If that is true (please tell me if you know that is the casae), then we can replace $e^x$ with $|x|$ right? $\text{softmax}(x)_i = \frac{|x_i|}{\sum_j |x_j|}$ – Sebastian Nielsen Oct 22 '18 at 21:58
@SebastianNielsen No, that's different. Graph $y = \frac{e^x}{e^x + 1}$ and $y = \frac{|x|}{|x| + 1}$ and see they're rather different. The former is strictly increasing in $x$. The latter is decreasing in $x$ for $x < 0$. – Matthew Gunn Oct 22 '18 at 22:03
3

@SebastianNielsen My point was that $f(x) = \frac{a^x}{a^x + 1} $ and $f(x) = \frac{e^{bx}}{e^{bx} + 1}$ are literally the same function for $b = \ln a$. Basic math: $e^{x \ln a} = e^{\ln a^x} = a^x$. – Matthew Gunn Oct 22 '18 at 22:06

Aksakal · Answer 2 · 2021-02-13T21:44:15.837

Some math becomes easier with $e$ as a base, that's why. Otherwise, consider this form of softmax: $\frac{e^{ax_i}}{\sum_j e^{ax_j}}$, which is equivalent to $\frac{b^{x_i}}{\sum_j b^{x_j}}$, where $b=e^a$.

Now, consider this function: $\sum_i\frac{e^{ax_i}}{\sum_j e^{ax_j}} x_i$. You can play with coefficient $a$ making the function less or more soft max.

When $a\to\infty$, it is $\max(x)$ because $\lim_{a\to\infty}\frac{e^{ax_i}}{\sum_j e^{ax_j}}=\mathrm{argmax}(x)$.

When $a=1$ it is $\mathrm{softmax}(x)\cdot x$ - a smoother version of max.

When $a=0$ it is as soft as it gets: a simple average $\frac 1 n \sum_i x_i$

score 3 · Answer 3 · answered Feb 13 '21 at 19:33

This is indeed a somewhat arbitrary choice:

The choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives.

An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family

Some potential reasons why this may be preferred over other normalizing functions:

it frames the inputs as log-likelihoods
it is easily differentiable

score 0 · Answer 4 · answered Sep 15 '22 at 13:45

$\DeclareMathOperator*{\argmax}{\arg\!\max}$
In the context of classification, we want to predict the most likely class $c^*$ that a feature vector $\mathbf x$ belongs to as $$c^* = \argmax_{i} P(C = c_i \mid \mathbf x)$$ where $c_1,\dots,c_N$ are the $N$ different classes. It is often convenient to re-write $P(C = c_i \mid \mathbf x)$ using Bayes' rule. Note that \begin{align} P(C = c_i \mid \mathbf x) &= \frac{p(\mathbf x \mid C = c_i)P(C = c_i)}{p(\mathbf x)} \\ &= \frac{p(\mathbf x \mid C = c_i)P(C = c_i)}{\sum_{j=1}^N p(\mathbf x \mid C = c_j) P(C = c_j)} \end{align} For any $a,b \in \mathbb R$, $$a^{\log_a(b)} = b$$ So, \begin{align} P(C = c_i \mid \mathbf x) &= \frac{p(\mathbf x \mid C = c_i)P(C = c_i)}{\sum_{j=1}^N p(\mathbf x \mid C = c_j) P(C = c_j)} \\ &= \frac{a^{\log_a\left(p(\mathbf x \mid C = c_i)P(C = c_i)\right)}}{\sum_{j=1}^N a^{\log_a\left(p(\mathbf x \mid C = c_j) P(C = c_j)\right)}} \end{align} Letting $a = e$ such that \begin{align} z_i &= \ln\left(p(\mathbf x \mid C = c_i)P(C = c_i)\right) \\ z_j &= \ln\left(p(\mathbf x \mid C = c_j)P(C = c_j)\right) \end{align} and $$\mathbf z = \begin{bmatrix} z_1 \\ \vdots \\ z_N\end{bmatrix}$$ we get \begin{align} P(C = c_i \mid \mathbf x) &= \frac{\exp\left(\ln\left(p(\mathbf x \mid C = c_i)P(C = c_i)\right)\right)}{\sum_{j=1}^N \exp\left(\ln\left(p(\mathbf x \mid C = c_j) P(C = c_j)\right)\right)} \\ &= \frac{\exp\left(z_i\right)}{\sum_{j=1}^N \exp\left(z_j\right)} \\ &= \text{softmax}(\mathbf z)_i \end{align} However, we could have chosen $a$ to be any other real number other than $e$. So, there isn't really anything special about $e$ except its nice properties.

Why 'e' in softmax?

4 Answers4

Using a different base is equivalent to scaling your data

So why base $e$?

Linked