When was the Leaky ReLU activation function first used?

Question

An earlier question discovered the first use of the ReLU function. In what paper was the Leaky ReLU activation function first used? By that, I mean the first use of this equation:

$$ f(x, \alpha) = \left. \begin{cases} x & \text{if } x \geq 0 \\ \alpha x & \text{if } x \lt 0 \end{cases} \right\} $$

This function has implementations in PyTorch and Keras.

Good question. In Google Ngram, we find mentions of the Leaky ReLU starting in 2013. Still, in Google Scholar, we find papers such as "Object detection and recognition using methods of computer vision", dating from 2011, talking about Leaky ReLU. Note that the ReLU was seemingly coined/invented(?) by Nair in 2010. I didn't read Nair's paper, but I bet he knew about the issue of a 0 gradient and talked about it. — Florian Fasmeyer, Jun 24 '23 at 07:31

score 1 · Answer 1 · answered Dec 31 '22 at 15:53

Interestingly, the above version is sometimes called prelu (parametric relu), see wikipedia page. The leaky one is with $a=0.01$ although they are in the same form. The Prelu implementation in keras and pytorch also makes the parameter a learnable, so that's why there is two of them and it wouldn't be too meaningful to set $a$ to $0.01$ for the entire ML industry to use. In the inference phase, it doesn't matter if it is a prelu or a negative-slope adjustable leaky relu.

That being said, I think Maas et.al's paper in 2013 might be the first publication in modern deep learning that mentions it. (They use $a=0.01$) They don't specifically refer to another source for using this function, but from their explanation, I understand that this function was first defined/mentioned somewhere else.

...To alleviate potential problems caused by the hard 0 activation of ReL units, we additionally evaluate leaky rectified linear (LReL) hidden units ...

At least, this looks like it is probably the first modern deep learning reference to it.

When was the Leaky ReLU activation function first used?

1 Answers1