Why do we use the natural exponential in logistic regression?

Question

I would like to intuitively understand the benefit of using the natural exponential in the sigmoid function used in logistic regression.

Why should it have to be $e^x$ instead of, for example $2^x$?

It is because the base of the log is e. If the base were 2 then you could use $2^x$. — Michael R. Chernick, Apr 05 '18 at 07:00
To continue the Chernick comment, $2^x = e^{ln(2)x}$, so the choice of the base 2, e, 7.94, etc. is irrelevant. However, mathematically, the function $f(x) = e^x$ has the property the $f'(x) = f(x) $ , that is the derivative is itself, which makes a lot of computations easier. For example if $g(x) = 2^x$, then $g'(x) = ln(2)*g(x) $ and so computations are messier. In calculating the derivative of the sigmoid, the fact that $e^x$ is it's own derivative leads to a nice formula for the derivative of the sigmoid. — meh, Apr 05 '18 at 15:16
The short answer is mathematical simplicity and convenience, as others explain. But an example with opposite flavour is that using a parameterisation in terms of $2^{-t/\tau}$ rather than of $\exp(-t/\tau)$ in modelling exponential decline with time $t$ (or distance) has one advantage in that $\tau$ is immediately a half-life or halving distance (with similar comments for growth and doubling). That can be a little more direct in reporting, especially to less numerate groups. — Nick Cox, Apr 05 '18 at 15:25
Because a) any different logarithmic constant is just an ancillary parameter in an exponential family and b) a nice property of $\exp$ is that $\frac{\partial}{\partial x} \exp(x) = \exp(x)$. — AdamO, Apr 05 '18 at 15:36
For logistic regression, bases like $2$ or $10$ aren't special and have nothing to recommend them. The exponential function is special because $e^x$ is approximately equal to $x$ for small $|x|$. This leads to simple interpretations of coefficients and is not true of any other base.
(That's also why, historically, natural logarithms were the first ones invented and tabulated.) Thus, you should be asking the inverse of this question in circumstances where you do not see $e$ as the base of the exponential. — whuber, Apr 05 '18 at 20:22

score 9 · Answer 1 · answered Apr 05 '18 at 14:27

9

Because base $e$ is convenient, and it doesn't matter if you can freely scale your coefficient estimate.

Would using a functional form of $\frac{a^\mathbf{x\cdot b}}{1 + a^\mathbf{x\cdot b} }$ change your explanatory power? No.

Explanation:

I gave basically the same answer here for the softmax function.

Observe that $ e^ { \mathbf{x} \cdot \mathbf{b} \left( \ln a \right) } = a^ {\mathbf{x} \cdot \mathbf{b}}$. Hence:

$$ \frac{a^\mathbf{x\cdot b}}{1 + a^\mathbf{x\cdot b} } = \frac{e^\mathbf{x\cdot \tilde{b}}}{1 + e^\mathbf{x\cdot \tilde{b}} } $$

Where $\tilde{\mathbf{b}} = \left( \ln a \right) \mathbf{b} $. So using a different base than $e$ in the sigmoid function is the same as scaling your $\mathbf{b}$ vector.

answered Apr 05 '18 at 14:27

Matthew Gunn

22,329

3

It's nice not to have to carry factors of $\ln a$ when doing IRLS or backpropagation. – Bridgeburners Apr 05 '18 at 14:43
@Bridgeburners Could you elaborate? – Matthew Gunn Apr 05 '18 at 14:53
2

I just mean that, when doing optimization that requires taking the derivative of a loss function that includes sigmoids, if those sigmoids used base $a$ instead of base $e$ with its exponential expression, then we would have to track the factors of $\ln a$ that come from taking the derivative of those sigmoids. (Not that those two examples I listed always includes sigmoids.) – Bridgeburners Apr 05 '18 at 15:00
3

@Bridgeburners Got it. You're giving another reason why $\frac{d}{dx} e^x = e^x$ is a convenient property. – Matthew Gunn Apr 05 '18 at 15:10

score 8 · Answer 2 · answered Apr 05 '18 at 15:00

8

In binary regression, one can use any cdf to relate the probability $\mathbb{P}(Y=1|\mathbf{x})$ and $\mathbf{x}$ in a generalised linear way $$\mathbb{P}(Y=1|\mathbf{x})=\Phi(\mathbf{x}^\text{T}\beta)$$as in

logistic cdf, $\Phi(t)=1/\{1+1/e^t\}$
probit (Normal) cdf, $\Phi(t)=\int_{-\infty}^t \varphi(x)\text{d}x$
log-log cdf, $\Phi(t)=\exp\{-\exp(-x)\}$

The logistic offers some advantages, as making the conditional regression an exponential family model.

answered Apr 05 '18 at 15:00

Xi'an

105,342

1

This is a good answer to a more general question: "Why sigmoid?" although OP asked "Why $e^x$ in the sigmoid?". It is implied that one need not take $e^x$ but could take, say, $x^2/(1-x^2)$ for a more locally linear activation function. Nonetheless, the three models you present are important to contextually understand modeling binary outcomes. Along those lines, I would point out that the main advantage of the sigmoid is that the linear model has coefficients interpreted as log-odds ratios. the complementary-log-log model estimates hazard ratio in a discrete time survival model. – AdamO Apr 06 '18 at 16:07
I meant the antiderivative of $(1-x^2)/(1+x^2)$ which is a scaled arctangent function – AdamO Apr 06 '18 at 16:16

AdamO · Answer 3 · 2018-04-05T19:38:51.313

6

For a Bernoulli likelihood, the variance is a function of the mean such that:

$$\text{var}(Y) = E(Y)(1-E(Y))$$

It turns out that a sigmoid function, also called the "inverse link" (for a logistic regression) function: $S(x) = \frac{\exp(x)}{1+\exp(x)}$ has the property that:

$$\frac{\partial}{\partial x} S(X) = S(X)(1-S(X))$$

It turns out this property holds for all GLMs using canonical parametrizations for exponential families.

edited Apr 05 '18 at 19:38

answered Apr 05 '18 at 15:40

AdamO

62,637

1

Adam can you explain why this is important? – seanv507 Apr 05 '18 at 20:02
@seanv507 when we find the MLE for the canonical parameter, it achieves the Cramer-Rao lower bound. That's why logistic regression gives tighter confidence bounds for a sample proportion (transformed from the log-odds scale, e.g. $S^{-1}(p)$) than the normal approximation. – AdamO Apr 05 '18 at 20:32
@AdamO: actually, I wonder if in the case of the Bernoulli this exponential family requirement excludes any cdf $\Phi$. – Xi'an Apr 06 '18 at 09:39
@Xi'an Hmm. Well the RV with DF $\Phi$ is normally distributed which follows an exp family, agreeing with the known optimal conditions of linear regression. Or are you speaking of probit regression? – AdamO Apr 06 '18 at 16:02
1

@Xi'an . Or another way I might see your comment: is the probit less efficient than logistic regression because it is not ML for a Bernoulli outcome? In that case, you're right, logistic will have better coverage rates of $1-\alpha$ CIs. – AdamO Apr 06 '18 at 16:10
@AdamO: I meant the regression/generalised linear model part. – Xi'an Apr 06 '18 at 16:27

Why do we use the natural exponential in logistic regression?

3 Answers3

Explanation: