15

I am doing an introduction to ML with tensorflow and I came across softmax activation function. Why is in the softmax formula e? Why not 2? 3? 7?

$$ \text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)} $$

$$ \begin{eqnarray} \sum_j a^L_j & = & \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1. \tag{79}\end{eqnarray} $$

Tensorflow tutorial

NN book

Gillian
  • 253

4 Answers4

13

Using a different base is equivalent to scaling your data

Let $\mathbf{z} = \left(\ln a\right) \mathbf{y}$

Now observe that $e^{z_i} = a^{y_i}$ hence:

$$ \frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{a^{y_i}}{\sum_j a^{y_j}}$$

Multiplying vector $\mathbf{y}$ by the natural logarithm of $a$ is equivalent to switching the softmax function to base $a$ instead of base $e$.

You often have a linear model inside the softmax function (eg. $z_i = \mathbf{x}' \mathbf{w}_i$). The $\mathbf{w}$ in $\mathbf{x}' \mathbf{w}$ can scale the data so allowing a different base wouldn't add any explanatory power. If the scaling can change, there's a sense in which different base $a$ are all equivalent models.

So why base $e$?

In exponential settings, $e$ is typically the most aesthetically beautiful, natural base to use: $\frac{d}{dx} e^x = e^x$. A lot of math can look prettier on the page when you use base $e$.

Matthew Gunn
  • 22,329
  • Would the function be the same if e to the z was replaced with ln(z) ? – Jack Vial Nov 09 '17 at 17:05
  • @Jack I'm not sure I follow what you specifically had in mind? – Matthew Gunn Nov 09 '17 at 17:24
  • If e^x = ln(x). Can the softmax function be written using ln(x) instead of e^x? – Jack Vial Nov 09 '17 at 18:01
  • You could write the softmax as $\operatorname{Softmax}(\mathbf{x})i = \frac{e^{x_i}}{\sum{j=1}^k e^{x_j}} = \exp\left(\ln\left( \frac{e^{x_i}}{\sum_{j=1}^k e^{x_j}} \right) \right) = \exp\left( x_i - \ln\left( \sum_{j=1}^k e^{x_j} \right)\right)$. But other than that, I really don't know what you're looking for? – Matthew Gunn Nov 09 '17 at 18:42
  • 2
    MatthewGunn I think he wanted to know if rewriting the softmax equation to: $\text{softmax}(x)_i = \frac{\ln(x_i)}{\sum_j \ln(x_j)}$ would yield the same result as $$\text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$$. However the statement @Jack made that: "$e^x =ln(x)$" isn't true, so I am not sure where he was going with that – Sebastian Nielsen Oct 22 '18 at 21:45
  • I have a question. The only reason we raise the independent variable as an exponent (x, y, z - whatever you want to call it) is so that we can avoid negative values canceling out postive values right? If that is true (please tell me if you know that is the casae), then we can replace $e^x$ with $|x|$ right? $\text{softmax}(x)_i = \frac{|x_i|}{\sum_j |x_j|}$ – Sebastian Nielsen Oct 22 '18 at 21:58
  • @SebastianNielsen No, that's different. Graph $y = \frac{e^x}{e^x + 1}$ and $y = \frac{|x|}{|x| + 1}$ and see they're rather different. The former is strictly increasing in $x$. The latter is decreasing in $x$ for $x < 0$. – Matthew Gunn Oct 22 '18 at 22:03
  • 3
    @SebastianNielsen My point was that $f(x) = \frac{a^x}{a^x + 1} $ and $f(x) = \frac{e^{bx}}{e^{bx} + 1}$ are literally the same function for $b = \ln a$. Basic math: $e^{x \ln a} = e^{\ln a^x} = a^x$. – Matthew Gunn Oct 22 '18 at 22:06
4

Some math becomes easier with $e$ as a base, that's why. Otherwise, consider this form of softmax: $\frac{e^{ax_i}}{\sum_j e^{ax_j}}$, which is equivalent to $\frac{b^{x_i}}{\sum_j b^{x_j}}$, where $b=e^a$.

Now, consider this function: $\sum_i\frac{e^{ax_i}}{\sum_j e^{ax_j}} x_i$. You can play with coefficient $a$ making the function less or more soft max.

When $a\to\infty$, it is $\max(x)$ because $\lim_{a\to\infty}\frac{e^{ax_i}}{\sum_j e^{ax_j}}=\mathrm{argmax}(x)$.

When $a=1$ it is $\mathrm{softmax}(x)\cdot x$ - a smoother version of max.

When $a=0$ it is as soft as it gets: a simple average $\frac 1 n \sum_i x_i$

Aksakal
  • 61,310
3

This is indeed a somewhat arbitrary choice:

The choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives.

Some potential reasons why this may be preferred over other normalizing functions:

  • it frames the inputs as log-likelihoods
  • it is easily differentiable
0

$\DeclareMathOperator*{\argmax}{\arg\!\max}$
In the context of classification, we want to predict the most likely class $c^*$ that a feature vector $\mathbf x$ belongs to as $$c^* = \argmax_{i} P(C = c_i \mid \mathbf x)$$ where $c_1,\dots,c_N$ are the $N$ different classes. It is often convenient to re-write $P(C = c_i \mid \mathbf x)$ using Bayes' rule. Note that \begin{align} P(C = c_i \mid \mathbf x) &= \frac{p(\mathbf x \mid C = c_i)P(C = c_i)}{p(\mathbf x)} \\ &= \frac{p(\mathbf x \mid C = c_i)P(C = c_i)}{\sum_{j=1}^N p(\mathbf x \mid C = c_j) P(C = c_j)} \end{align} For any $a,b \in \mathbb R$, $$a^{\log_a(b)} = b$$ So, \begin{align} P(C = c_i \mid \mathbf x) &= \frac{p(\mathbf x \mid C = c_i)P(C = c_i)}{\sum_{j=1}^N p(\mathbf x \mid C = c_j) P(C = c_j)} \\ &= \frac{a^{\log_a\left(p(\mathbf x \mid C = c_i)P(C = c_i)\right)}}{\sum_{j=1}^N a^{\log_a\left(p(\mathbf x \mid C = c_j) P(C = c_j)\right)}} \end{align} Letting $a = e$ such that \begin{align} z_i &= \ln\left(p(\mathbf x \mid C = c_i)P(C = c_i)\right) \\ z_j &= \ln\left(p(\mathbf x \mid C = c_j)P(C = c_j)\right) \end{align} and $$\mathbf z = \begin{bmatrix} z_1 \\ \vdots \\ z_N\end{bmatrix}$$ we get \begin{align} P(C = c_i \mid \mathbf x) &= \frac{\exp\left(\ln\left(p(\mathbf x \mid C = c_i)P(C = c_i)\right)\right)}{\sum_{j=1}^N \exp\left(\ln\left(p(\mathbf x \mid C = c_j) P(C = c_j)\right)\right)} \\ &= \frac{\exp\left(z_i\right)}{\sum_{j=1}^N \exp\left(z_j\right)} \\ &= \text{softmax}(\mathbf z)_i \end{align} However, we could have chosen $a$ to be any other real number other than $e$. So, there isn't really anything special about $e$ except its nice properties.

mhdadk
  • 4,940