Why can't we use back propagation in "Hard attention" but we can use it in "RELU" function and max-pooling?

Question

RELU, argmax function(in hard attention) and max-pooling are non-differentiable functions but We use back-propagation with RELU and max-pooling without any problems. What does make "Hard attention" different than them?

Because hard attention randomly samples attention from a given set of attention weights, so you need to run multiple samples and then average the response before applying backpropagation. See here for example: https://jhui.github.io/2017/03/15/Soft-and-hard-attention/ — Alex R., Jan 10 '19 at 18:08

score 1 · Answer 1 · answered Jan 12 '19 at 16:53

1

The gradient of argmax is zero almost everywhere and undefined where it is not zero. Gradients need to be nonzero if you want any weight updates to happen

The gradient of maxpooling is nonzero almost everywhere. The gradient of relu is also nonzero for all positive inputs. When all inputs to a relu unit are negative, backprop fails and the unit will stop updating. This is known as a "dying relu", although it isn't a huge problem in general.

answered Jan 12 '19 at 16:53

shimao

26,092

I think that the gradient of maxpooling is zero almost everywhere too, I read the derivative of min function the derivative of min looks like max. Or Did you mean that in convolution neural networks, the derivative of maxpooling is zero everywhere in local regions in the pooling layers but it's not a problem on the whole layer because we apply the maxpooling operation on the entire input? – floyd Jan 12 '19 at 17:24
@floyd the gradient of maxpooling is not zero almost everywhere – shimao Jan 12 '19 at 19:09

score 0 · Answer 2 · answered Jan 20 '20 at 16:53

Hard-attention is a good biological inductive bias to try to build an attentive mechanism into neural networks.

Unfortunately, it would necessarily map into the "arg max" function, to return the position of the input with the biggest attention score.

The "argmax" function is not differentiable, as you point out in your question.

However, the "max" function used in ReLU and MaxPool is, in fact, piecewise-differentiable.

$$\max(x,y) = {x + y + |x - y|\over 2} = \begin{cases}{x, x \gt y\\ y, x \lt y}\end{cases}$$

Why can't we use back propagation in "Hard attention" but we can use it in "RELU" function and max-pooling?

2 Answers2