What is the relationship between the size of the hidden layer and the size of the cell state layer in an LSTM?

Question

I was following some examples to get familiar with TensorFlow's LSTM API, but noticed that all LSTM initialization functions require only the num_units parameter, which denotes the number of hidden units in a cell.

According to what I have learned from the famous colah's blog, the cell state has nothing to do with the hidden layer, thus they could be represented in different dimensions (I think), and then we should pass at least 2 parameters denoting both #hidden and #cell_state.

So, this confuses me a lot when trying to figure out what the TensorFlow's cells do. Under the hood, are they implemented like this just for the sake of convenience or did I misunderstand something in the blog mentioned?

score 4 · Answer 1 · answered Sep 30 '19 at 04:24

4

I had a very similar issue as you did with the dimensions. Here's the rundown:

Every node you see inside the LSTM cell has the exact same output dimensions, including the cell state. Otherwise, you'll see with the forget gate and output gate, how could you possible do an element wise multiplication with the cell state? They have to have the same dimensions in order for that to work.

Using an example where n_hiddenunits = 256:

Output of forget gate: 256
Input gate: 256
Activation gate: 256
Output gate: 256
Cell state: 256
Hidden state: 256

Now this can obviously be problematic if you want the LSTM to output, say, a one hot vector of size 5. So to do this, a softmax layer is slapped onto the end of the hidden state, to convert it to the correct dimension. So just a standard FFNN with normal weights (no bias', because softmax). Now, also imagining that we input a one hot vector of size 5:

input size: 5
total input size to all gates: 256+5 = 261 (the hidden state and input are appended)
Output of forget gate: 256
Input gate: 256
Activation gate: 256
Output gate: 256
Cell state: 256
Hidden state: 256
Final output size: 5

That is the final dimensions of the cell.

answered Sep 30 '19 at 04:24

Recessive

1,396
8
21

1

All the gates have to have the same dimension output as the cell state in order to perform element-wise multiplications. However, this does not answer why the hidden state and the cell state vectors have to have the same dimensions. You could very well have different hidden and cell state dimensions and still force the concatenation of the hidden state with the input vector to have the same dimension as the cell state after going through the gates. – Mike Apr 10 '20 at 22:25
@Mike that's exactly what I'm thinking, and I don't get why in TF the units parameter represent both the cell size and the output size – Alberto Jul 27 '22 at 23:46
@AlbertoSinigaglia Can't answer Mike's question right now (I assume it's just convention to keep sizes the same, but don't know 100%), but the reason the units parameter represents both is because the output is just the hidden state. The idea of this is that if you don't want the hidden state to be this size, you can just add a linear layer on top of it and transform it how you see fit, which is what defining the output size would do anyway – Recessive Jul 28 '22 at 02:01
@RecessiveI'm saying that all the dimensions can be different. If you check the equations on PyTorch doc you see that the only requirements are that the hidden cell has to have the same size as the gates, the rest can vary – Alberto Jul 28 '22 at 09:25

score 1 · Answer 2 · answered May 20 '21 at 17:36

What I understand with a layer of LSTM composed of 4 cells is depicted in the following picture:

This would explain the fact that the hidden state of the whole layer has exactly the same dimension of the hidden states (or cells).

However, what I still don't fully understand is the 'return sequence' between LSTM layers, which changes the shape from [hidden_states] to [x_dimension, hidden_states]. This is explained because usually we only care about the state of the last cell, and when connecting multiple layers, all the states of the cells are passed into the next layer. Nevertheless, I still cannot make sense of it graphically.

e.g. model = keras.models.Sequential([ keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]), keras.layers.LSTM(20, return_sequences=True), keras.layers.TimeDistributed(keras.layers.Dense(10)) ])

score 1 · Answer 3 · answered May 24 '21 at 10:13

1

Look at the equation for computing the hidden state as a function of the cell state and output gate: $$ h_t = \tanh(C_t)\circ o_t $$ This equation implies that the hidden state and cell state have the same dimensionality.

answered May 24 '21 at 10:13

Stand with Gaza

111
3

What is the relationship between the size of the hidden layer and the size of the cell state layer in an LSTM?

3 Answers3