RELU, argmax function(in hard attention) and max-pooling are non-differentiable functions but We use back-propagation with RELU and max-pooling without any problems. What does make "Hard attention" different than them?
-
1Because hard attention randomly samples attention from a given set of attention weights, so you need to run multiple samples and then average the response before applying backpropagation. See here for example: https://jhui.github.io/2017/03/15/Soft-and-hard-attention/ – Alex R. Jan 10 '19 at 18:08
2 Answers
The gradient of argmax is zero almost everywhere and undefined where it is not zero. Gradients need to be nonzero if you want any weight updates to happen
The gradient of maxpooling is nonzero almost everywhere. The gradient of relu is also nonzero for all positive inputs. When all inputs to a relu unit are negative, backprop fails and the unit will stop updating. This is known as a "dying relu", although it isn't a huge problem in general.
- 26,092
-
I think that the gradient of maxpooling is zero almost everywhere too, I read the derivative of min function the derivative of min looks like max. Or Did you mean that in convolution neural networks, the derivative of maxpooling is zero everywhere in local regions in the pooling layers but it's not a problem on the whole layer because we apply the maxpooling operation on the entire input? – floyd Jan 12 '19 at 17:24
-
Hard-attention is a good biological inductive bias to try to build an attentive mechanism into neural networks.
Unfortunately, it would necessarily map into the "arg max" function, to return the position of the input with the biggest attention score.
The "argmax" function is not differentiable, as you point out in your question.
However, the "max" function used in ReLU and MaxPool is, in fact, piecewise-differentiable.
$$\max(x,y) = {x + y + |x - y|\over 2} = \begin{cases}{x, x \gt y\\ y, x \lt y}\end{cases}$$
- 19,076
- 6
- 77
- 139