Why isn't (symmetric) log(1+x) used as neural network activation function?

Question

Specifically, I mean

$$ f(x)= \begin{cases} -\log(1-x) & x \le 0 \\ \space \space \space \log(1+x) & x \gt 0 \\ \end{cases} $$ Which is red in the plot:

It behaves similarly to widely used $\tanh(x)$ (blue) except it avoids saturation/vanishing gradients since it has no horizontal asymptotes. It's also less computationally expensive.

Is there some issue with it I'm missing?

Sycorax · Accepted Answer · 2023-03-24T13:16:35.783

For a long time, neural network researchers believed that sigmoid activations like the inverse logit and $\tanh$ were the only activations that were necessary. This is because the Cybenko (1989) Universal Approximation Theorem (loosely) states that, under certain conditions, a neural network can approximate certain functions to a desired level of precision with 1 hidden layer & a finite number of units. One of the conditions is that the activation function is bounded. (For full details, consult the paper.)

The function $f(x)=\log(1+x)$ is not bounded, so it does not satisfy the boundedness condition.

However, in the time since Cybenko published his UAT, many other UAT variations have been proven in different settings & allowing more flexibility in the choice of activation functions, number of layers, and so on.

From the perspective of modern neural network theory, you would need to show that the proposed activation has some desirable property that is not found in alternative choices. One problem that I anticipate with this activation is that its derivative is $f^\prime(x)=\frac{1}{1+|x|}$, and goes to 0 as $x$ gets large or small. This is undesirable because of the vanishing gradient phenomenon.

By contrast, an activation function with derivative exactly 1 for a "large" portion of inputs is preferable because it ameliorates the vanishing gradient. An example of this type of function is the ReLU and related functions.

Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989). https://doi.org/10.1007/BF02551274

"[...] derivative [...] is strictly less than 1 almost everywhere [...]" -- Isn't this true for all activation functions in common use except ReLU and its variants, though? Do modern nets only use these? — yuri kilochek, Feb 16 '23 at 15:38
Indeed, this is the reason that ReLU-like activations are used in almost all modern neural networks. — Sycorax, Feb 16 '23 at 15:53

Why isn't (symmetric) log(1+x) used as neural network activation function?

1 Answers1