Case 1:
I am feeding a variable-length input time-series window to the GRU model. Sometimes there may be 900 samples in the window, and sometimes there may be only 16. I fed into the RNN model (GRU) since I learned that RNN methods work better on long sequences. I utilize one GRU layer and get hidden sequences across all the time stamps in order to get maximum information of all the time stamps. Then, I used average pooling on GRU output to bring representation into fixed-length. The intuition of using average-pooling instead of max-pooling is that it may achieve summarized information of all the timestamps. Here is the code of the model:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5, return_sequences=True
)(input_mask)
AP = tf.keras.layers.GlobalAveragePooling1D()(gru_l5)
gru_fm = tf.keras.layers.Dropout(0.3)(AP)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
From this model, I am obtaining better performance on validation set while on training data, performance increased by 100% (going for worst), however, the major issue is that validation loss is "nan." This issue is currently being explored on GitHub and StackOverflow.
I tried nearly all of the options provided here, here and here. But unable to resolve this validation_loss = non issue.
Case 2:
Then I decided not to get all of the GRU's hidden states but rather to retrieve only the last hidden state, which would provide a fixed-length representation and eliminate the requirement for pooling. Here, the validation loss as "nan" probelm is fixed, but the test data performance is drastically reduced. Here is this model's source code:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5)(input_mask)
gru_fm = tf.keras.layers.Dropout(0.3)(gru_l5)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
We can observe the results of both Cases. In Case 1, I have the feeling that the vanishing gradient problem occurs with longer sequences. Any thoughts or discussions on resolving this "nan" issue and achieving high performance would be much appreciated.