F1 score for validation and testing datasets is different

Question

I have the following F1 score function that I use for the model when I train it as part of metrics and as well during prediction:

#This is to calculate F1
def f1(y_true, y_pred):
    def recall(y_true, y_pred):
        """Recall metric.
    Only computes a batch-wise average of recall.

    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    &quot;&quot;&quot;
    print(y_pred)
    y_pred = y_pred.ravel() &lt; 0.5
    y_pred = tf.cast(y_pred, tf.float32, name=None)
    print(y_pred)
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    true_positives = tf.cast(true_positives, tf.float32, name=None)
    possible_positives = tf.cast(possible_positives, tf.float32, name=None)
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision(y_true, y_pred):
    &quot;&quot;&quot;Precision metric.

    Only computes a batch-wise average of precision.

    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    &quot;&quot;&quot;
    print(y_pred)
    y_pred = y_pred.ravel() &lt; 0.5
    y_pred = tf.cast(y_pred, tf.float32, name=None)
    print(y_pred)
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    true_positives = tf.cast(true_positives, tf.float32, name=None)
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    predicted_positives = tf.cast(predicted_positives, tf.float32, name=None)
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision
precision = precision(y_true, y_pred)
recall = recall(y_true, y_pred)
return 2*((precision*recall)/(precision+recall+K.epsilon()))/100

When I compile the model I use as follows:

model.compile(loss=ContrastivLoss(margin=1), optimizer=rms, metrics=["accuracy", f1])

Output during fitting the model on the training and validation dataset for the last epoch (with early stopping on):

Epoch 21/100
447/447 [==============================] - 1s 3ms/step - loss: 0.1646 - accuracy: 0.2271 - f1: 0.3198 - val_loss: 0.1963 - val_accuracy: 0.2695 - val_f1: 0.5232

Although I have validation dataset during training, I also kept some data for testing:

loss = model.evaluate(x=[test_data[:,0],test_data[:,1]], y=labels_test)
y_pred_train = model.predict([train_data[:,0], train_data[:,1]])
train_f1= f1(labels_train, y_pred_train)
train_accuracy = accuracy(labels_train, y_pred_train)
y_pred_test = model.predict([test_data[:,0], test_data[:,1]])
test_f1 = f1(labels_test, y_pred_test)
test_accuracy = accuracy(labels_test, y_pred_test)
print("Loss = {}, Train F1 = {} Test F1 = {}".format(loss, train_f1, test_f1))
print("Loss = {}, Train Accuracy = {} Test Accuracy = {}".format(loss, train_accuracy, test_accuracy))

Edit: Output, where you can see F1 score training and testing during model prediction phase, is very low compared to the one in training and validation during model fitting above:

Loss = [0.19634802639484406, 0.2694787085056305, 0.26106637716293335], Train F1 = 0.008032719604671001 Test F1 = 0.008442788384854794
Loss = [0.19634802639484406, 0.2694787085056305, 0.26106637716293335], Train Accuracy = 0.7953781512605042 Test Accuracy = 0.7305213004484304

Is it because in f1 I used for testing data I divide by 100? But I already did that with training and validation:

return 2*((precision*recall)/(precision+recall+K.epsilon()))/100

Edit: It's mentioned that this could be due to overfitting. However, if you look at the validation loss below, it does not show a clear sign of overfitting. So I am not sure what you would suggest please.

It seems to me from your last output that F1 is 0.0080 in training and 0.0084 in testing. That does not look like a big difference to me. (F1 and similar metrics suffer from exactly the same issues as accuracy, I would counsel against using them.) — Stephan Kolassa, Feb 01 '23 at 16:10
@StephanKolassa. Thank you. But the ones I meant to compare their F1 scores are validation/training f1 score versus testing/training f1 score after model training in the prediction phase. F1 I got in during model fitting is train-f1: 0.3198 -- val_f1: 0.5232 are very different from the F1 scores I got in the model prediction phase test-f1:0.0084 -- train-f1: 0.0080, is not it, please? I used the same F1 score function for both training/validation and in testing as I posted above. — Avv, Feb 01 '23 at 16:15
@StephanKolassa. I uploaded the loss figure, which does not show a sign of overfitting. So I am not sure if this can help. — Avv, Feb 02 '23 at 03:43

F1 score for validation and testing datasets is different

0 Answers0