I am attempting to train an LSTM NN on a timeseries dataset which contains over a thousand distinct devices with daily observations to predict when the devices will fail. I have handled my data cleaning and set-up appropriately, and can successfully train a NN on the data.
Now, I would like to run through many training iterations of the NN, each iteration with different hyperparameters, in order to find the best set training parameters for the data. In doing so, I am encountering a very strange error. As I train iterations of the NN, I am able to run through exactly 13 NN's, and then my program gets killed. When running in terminal, I simply get a message that says Killed, while running in console (Pycharm) gives me a process completed with exit code 1 message. I don't think the issue is running out of memory, as I keep the Heap size monitor up when running in console and I don't come near max. It also seems like more than coincidence that I would fail on the 13th run every time, given that I pick hyperparamters randomly (so I am not running the same 13 NNs every time I run the program).
Does anyone have advice on where I might start troubleshooting this?
Thank you.
Function I loop over to train subsequent NNs:
lrate = round(10 ** (-4 * random.uniform(0, 1)), 3) # 10e-4 to 10
decay_rate = round(10 ** (-4 * random.uniform(0, 1)), 3)
epochs_ = random.randint(25, 250)
batches = random.randint(8, 64)
l2_rate = round(10 ** (-4 * random.uniform(0, 1)), 3)
dropout_rate = round(random.uniform(0.1, 0.5), 3)
record_params = {'hidden_layers': hidden_layers, 'lrate': lrate, 'decay_rate': decay_rate, 'epochs': epochs_,
'batches': batches, 'l2_rate': l2_rate, 'dropout_rate': dropout_rate}
model = Sequential()
model_input_shape = X_train_input.shape[1:]
model.add(LSTM(hidden_layers, input_shape=model_input_shape, return_sequences=True,
kernel_regularizer=regularizers.l2(l2_rate),
activation=activation_func)) # return_sequences=True,
model.add(Dropout(dropout_rate))
model.add(BatchNormalization())
model.add(LSTM(hidden_layers, kernel_regularizer=regularizers.l2(l2_rate),
activation=activation_func)) # return_sequences=True,
model.add(Dropout(dropout_rate))
model.add(BatchNormalization())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=optimizers.Adam(lr=lrate, decay=decay_rate),
metrics=[metrics.AUC(), metrics.FalseNegatives(), metrics.FalsePositives()])
model.fit(X_train_input, y_train_input2, batch_size=batches, epochs=epochs_,
validation_data=(X_val_input, y_val_input2), verbose=1,
shuffle=False, callbacks=[EarlyStopping(monitor='val_loss', patience=15)]
)
# test
yhat = model.predict(X_train_input, batch_size=batches)
yhat_val = model.predict(X_val_input, batch_size=batches)
# record
record_params['train_roc'] = round(roc_auc_score(y_train_input2, yhat.reshape(yhat.shape[0])), 3)
record_params['val_roc'] = round(roc_auc_score(y_val_input2, yhat_val.reshape(yhat_val.shape[0]),
average='weighted'), 3)
return pd.DataFrame(record_params, index=[ind]) ```
[Does not seem like a memory issue - this screenshotted during 5th NN run. ][1]
[1]: https://i.stack.imgur.com/Rb9GL.png