17

When was the ReLU function first used in a neural network?

By ReLU, I mean the function $$ f(x) = \max\{0, x\}. $$

By neural network, I mean function approximation machines which are comprised of one or more "hidden layers."

(That is, I wish to exclude models which are viewed as "special cases" of neural networks because if we admitted such special cases, then the question would reduce to something along the lines of "when did anyone, in any context, first propose the idea of thresholding values below 0?" which is not really interesting to me.)

Sycorax
  • 90,934

3 Answers3

25

The earliest usage of the ReLU activation that I've found is Fukushima (1975, page 124, equation 2). Thanks to johann to pointing this out. Fukushima also wrote at least one other paper involving ReLU activations (1980), but this is the earliest one that I am aware of. Unless I missed something, the function is not given any particular name in this paper. I am not aware of an older reference, but because terminology is inconsistent and rapidly changing, it's eminently possible that I've missed a key detail in an even older publication.

It is common to cite Nair & Hinton (2010) as the first usage of $f$. For example, Schmidhuber (2014) cites Nair & Hinton when discussing ReLU units in his review article. Certainly, Nair & Hinton's paper is important because it spurred the recent interest in using $f$ in neural networks, and it is the source of the modern nomenclature "rectified linear units." Nonetheless, the idea of using $f$ as an activation is decades older than the 2010 paper.

Incidentally, Hinton also coauthored a chapter in Parallel Distributed Processing in which $f$ was used in a neural network. In this paper, $f$ is called the "threshold function." However, this volume was published in 1986, eleven years after Fukushima's paper.


References

  • Jürgen Schmidhuber. "Deep Learning in Neural Networks: An Overview." 2014.

  • Fukushima, K. (1975). "Cognitron: A self-organizing multilayered neural network." Biological Cybernetics, 20(3-4), 121–136. doi:10.1007/bf00342633

  • Kunihiko Fukushima. "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position." Biological Cybernetics. 1980.

  • D.E. Rumelhart, G.E. Hinton, and J.L. McClelland. "A General Framework for Parallel Distributed Processing" in Parallel Distributed Computing, Vol 1. 1986.

  • Vinod Nair, Geoffrey E. Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines" 2010.

Sycorax
  • 90,934
  • Which poses the related question, who was the first to recognize (argue for) the superiority of the ReLU function over orhers? From the title of their paper it sounds like Nair & Hinton (2010) could be the first. Which would be surprisingly late, given that Fukushima introduced them decades before! – Max Feb 12 '23 at 12:26
  • Fukushima definitely used the ReLU way earlier, but I think Nair & Hinton can be fairly credited with popularizing it. – Sycorax Feb 12 '23 at 13:05
  • @Max Apparently, Fukushima recognized the superiority of ReLU first, that's why he used it in many of his papers. However, he could not popularize it. – THN Dec 15 '23 at 03:27
10

Fukushima published the original Cognitron paper in 1975. That was the first instance of ReLU. It is defined in equation 2 here:

Fukushima, K. (1975). Cognitron: A self-organizing multilayered neural network. Biological Cybernetics, 20(3), 121-136.

Sycorax
  • 90,934
johann
  • 109
9

Fukushima first used ReLU in a paper published in 1969, 6 years before the Cognitron paper, in a so-called analog threshold element (see Equation 2 and Figure 3):

enter image description here

K. Fukushima, "Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements," in IEEE Transactions on Systems Science and Cybernetics, vol. 5, no. 4, pp. 322-333, Oct. 1969, doi: 10.1109/TSSC.1969.300225.

THN
  • 598
isarandi
  • 849
  • 7
  • 12