3

In GRU units, I don't understand the effective difference between the update and reset gate, $z_t$ and $r_t$ respectively. \begin{align} z_t &= \sigma_g(W_{z} x_t + U_{z} h_{t-1} + b_z) \\ r_t &= \sigma_g(W_{r} x_t + U_{r} h_{t-1} + b_r) \\ \hat{h}_t &= \phi_h(W_{h} x_t + U_{h} (r_t \odot h_{t-1}) + b_h) \\ h_t &= z_t \odot \hat{h}_t + (1-z_t) \odot h_{t-1} \end{align}

in the final rule, $z_t$ should select how much information preserve from $\hat{h}_t$ and the old state $h_{t-1}$, but the amount of $h_{t-1}$ to retain is already selected in $\hat{h}_t$ by $r_t$. So, why a new selection of $h_{t-1}$ is made again with $1-z_t$?

1 Answers1

1

You are correct, the "forget" gate doesn't fully control how much the unit forgets about the past $h_{t-1}$. Calling it a "forget gate" was meant to facilitate an intuition about its role, but as you noticed, the unit is more complicated than that. The current hidden state $\hat h_t$ is a non-linear function of the current input $x_t$ and the past state $h_{t-1}$. We want it to be like this because the point of using recurrent neural networks is about modeling the non-linear changes over time. If $\hat h_t$ was only a non-linear function of the current inputs and we had a linear relationship between $\hat h_t$ and the past like $z_t \hat h_t + (1- z_t) h_{t-1}$, this would be just a kind of exponential smoothing with changing weights. But we want to model more complicated, non-linear changes over time. Also, keep in mind that the GRU unit is able to learn the simpler version of the model by zeroing the $r_t$ weights, so the simpler version is possible under GRU.

Another reason we have more complicated units like GRU or LSTM is the problems of vanishing and exploding gradients. While simpler RNNs should work, we noticed that there are computational problems when estimating their parameters, and GRU and LSTM were designed to overcome them.

Tim
  • 138,066
  • ok, But if we want to model more complicated, non-linear changes over time, is it not enough just to model $\hat{h}t$ as function of $x$ and $h{t-1}$ (as already done) and make simply $h_t = z_t\hat{h}_t$? – volperossa Aug 25 '22 at 10:17
  • @volperossa "enough" in what sense? For some problems, RNN would be "enough", for some GRU, for some LSTM, for some using classical time-series models like ARIMA, or exponential smoothing, for some models "enough" would be to always predict the constant like historical average. We have many models because not every one of them works for all cases. Sure, nobody prohibits you from trying simpler ones before going to something more complicated. – Tim Aug 25 '22 at 10:21
  • 1
    Also keep in mind that the unit was designed to prevent problems like vanishing and exploding gradients, not only to model the data. – Tim Aug 25 '22 at 10:23
  • I try to better explain myself (in two messages for answer length limits): in the GRU general form I see a redundancy of $h_{t-1}$, appearing twice: the first time in the $\hat{h}t$ expression, the second one in ${h}_t$ (that is function of both $\hat{h}_t$ and, again $h{t-1}$). I'm searching the ratio behind this. I understand your point that explain that it can help to find non-linear changes. – volperossa Aug 25 '22 at 10:40
  • However, this is already made in the $\hat{h}t$ definition including both non-linearity and $h{t-1}$, and I still don't understand why $h_{t-1}$ appears again later, in the linear combination defining $h_t$. If it is only for non-linear purposes, it should be enough to define $h_t=z\hat{h}_t$ or, more simply, $h_t=\hat{h}_t$. This is just to understand the reasons behind the GRU model proposal that it seems very complex to me (knowing, however, that there are simpler models). Is it only for avoiding vanishing/exploding gradients? – volperossa Aug 25 '22 at 10:40
  • 1
    @volperossa every neural network has a lot of redundancies. Classification of the neural network without redundancies is logistic regression, every more complicated NN has redundant nodes. They are complicated because we want to model complicated, non-linear functions & for computational reasons like vanishing/exploding gradients. – Tim Aug 25 '22 at 10:45