Validation performance is really problematic; should I give up on increasing the validation performance of my deep learning model?

Question

I am working on a multiclassification problem using time series data. Three datasets are utilized in this study. My deep neural network performs satisfactorily across two different data sets. However, on one dataset, the validation performance is highly problematic; the model progresses in the initial few epochs and then degrades the performance further. The following are screenshots:

Regarding my Deep Architecture:

I use a hybrid model (such as LSTM, CNN, attention, etc.). I can't show the code here, but I'm confident it's not a code issue because I reran the code of baseline study and saw the same trend, but they reported 90% + performance in their paper.

Regarding hyperparameter tuning:

I feel that obtaining optimal hyper-parameters may be one of the issues. But I have been experimenting with this model and this dataset for the past six months. I tried the original paper dropouts, optimizer, and activation function as-is, but I could not even achieve an 89% f1-score. In addition, I experimented with the following hyperparameters configurations:

Tried different optimizers (ADAM, RMS, SGS) with cross-entropy function and dropout on the fully connected layers.
Experimented with batch sizes ranging from 10 to 500, with a learning rate of 0.001 and a decay rate of (10^-9) for every ten epochs (this is exactly what state-of-the-art papers used).
Regarding weights initialization: I utilized recurrent_initializer=tf.keras.initializers.Orthogonal() on LSTM layers and the default weight initialization mechanisms in TensorFlow Keras for the rest of the model.

Regarding the Activation function: I utilized Relu on all model layers except LSTM. Regarding Preprocessing, I experimented with and without preprocessing but could not notice any significant impact. Here is the code for preprocessing:

 def normalization(x_train, x_val, x_test):
     mean_train = np.mean(x_train, axis=0)
     std_train = np.std(x_train, axis=0)
 x_train = (x_train - mean_train) / std_train
 x_val = (x_val - mean_train) / std_train
 x_test = (x_test - mean_train) / std_train
 return x_train, x_val, x_test

`

My present performance gain is in the range of 83 to 88.5%, whereas my target is at least 90% or 91%. I ran my experiments over a thousand times with a variety of parameter settings, but I was unable to meet my performance goals. Is there anything else I should try or give up from this dataset?

Also relevant: How to know that your machine learning problem is hopeless? Your target may be 90%, but that does not mean this target is possible to achieve. — Stephan Kolassa, Jul 17 '22 at 06:55
Also note that accuracy and F1 are not very good evaluation metrics. — Stephan Kolassa, Jul 17 '22 at 06:56
@StephanKolassa I read this post multiple time and tried to fix things but could not successful. Moreover, I also wonder that my neural network is working on other two datasets, but in this datasets it does not work. — Ahmad, Jul 17 '22 at 06:56
@StephanKolassa regarding Evaluation metric, my datasets are imbalanced and state of the art research use f1-score to report their performances, so I need to follow that if I want to compare the performance. — Ahmad, Jul 17 '22 at 06:59
I know that data scientists and computer scientists use F1 scores. It is still not a good choice. Probabilistic predictions, assessed using proper scoring rules, are superior - and there is zero concern with "imbalance". Unfortunately, data scientists and computer scientists lack the necessary statistical understanding. If you feel you need to use an inferior measure to stay within the research mainstream, that is of course an argument - but it doesn't make F1 any better. — Stephan Kolassa, Jul 17 '22 at 07:03
@StephanKolassa please don't close my question, I would like to see the response of other community members. Please wait for at least 5 days then close. Thanks. — Ahmad, Jul 17 '22 at 07:09
I voted to close, it takes two more closure votes. If you believe the proposed duplicate is not such, I suggest you edit your question, ideally at the very top, and explain precisely why you believe so. (You can also edit your question later, which will put it in the reopen queue, but that it often not quite as effective.) I have also pinged our resident NN expert in case they want to take a look, but of course that confers no obligation on them. — Stephan Kolassa, Jul 17 '22 at 07:18
Since the OP is not giving any weight to absolute calibration accuracy, the situation is probably worse than they suspect. F1 is a bad idea. — Frank Harrell, Jul 17 '22 at 11:39
@Xi'an sir really sorry for typos. Now I've modified the title of my post, but I'm not sure if it's OK. — Ahmad, Jul 17 '22 at 13:39
@FrankHarrell Thank you so much for your comment. I didn't get your sentence completely; could you perhaps elaborate? — Ahmad, Jul 17 '22 at 14:11
What Frank Harrell means is that you think the performance is poor because of an unsatisfactory F1 score, but your use of a problematic, threshold-based metric like F1 is likely hiding other issues, enough that the situation is even worse than you currently believe. For instance, how good is the probability calibration? How is the performance at thresholds other than the software default of $0.5$ that are likely to be more meaningful? — Dave, Jul 17 '22 at 15:38

Validation performance is really problematic; should I give up on increasing the validation performance of my deep learning model?

0 Answers0