4

Fitting 3 different models on a 5-class imbalanced dataset. The results show model accuracy always being equal to the recall. How can this be possible?

1. RF model results:
Test acc:   0.6285670349948376
Recall:     0.6285670349948376
Precision:  0.6171361174985392
f1_score:   0.5886671088640658
ROC AUC score:  0.7998931710957794
  1. MLP model results:

Accuracy: 0.44232332330133345 Recall: 0.44232332330133345 f1_score: 0.4242650817694506 Precision: 0.4707025922895617 ROC AUC score: 0.6031862642540948

  1. CNN model results:

Accuracy: 0.7411148092888021 Recall: 0.7411148092888021 f1_score: 0.741477630295568 Precision: 0.7972578281551425 ROC AUC score: 0.8291519390873785

Models' confusion matrices:

1. RF model
[[ 8753    87   494  5183    84]
 [  344   449    26   578     1]
 [ 1429    33  1311  5504    40]
 [ 1431   104   668 18072    26]
 [  350     0    11   515    28]]
  1. MLP model:

[[11106 574 677 1698 546] [ 904 172 106 180 36] [ 4897 657 530 2133 100] [ 7668 2448 1532 8301 352] [ 490 36 33 319 26]]

  1. CNN model:

[[6195 28 137 226 52] [ 108 789 39 16 6] [ 95 5 3113 376 10] [2506 326 2398 8570 238] [ 72 10 73 46 705]]

In all cases, accuracy=recall! How can this be possible?

EDIT

Metrics calculation:

1. RF model:
pred_test = model.predict(x_test)
test_acc = sklearn.metrics.accuracy_score(y_test, pred_test)
f1 = sklearn.metrics.f1_score(y_test, pred_test, average='weighted')
recall = sklearn.metrics.recall_score(y_test, pred_test, average='weighted')
precision = sklearn.metrics.precision_score(y_test, pred_test, average='weighted')
pred_prob = model.predict_proba(x_test)
roc = roc_auc_score(y_test, pred_prob, average='weighted', 
    multi_class='ovr',labels=[0,1,2,3,4])
  1. MLP

accuracy = sklearn.metrics.accuracy_score(y_test, y_pred) f1 = sklearn.metrics.f1_score(y_test, y_pred, average='weighted') recall = sklearn.metrics.recall_score(y_test, y_pred, average='weighted') precision = sklearn.metrics.precision_score(y_test, y_pred, average='weighted')

  1. CNN

Pred = model.predict(x_test, batch_size=32) Pred_Label = np.argmax(Pred, axis=1) labels=[0, 1, 2, 3, 4] ... ConfusionM = confusion_matrix(list(y_test_ori), Pred_Label, labels=labels) class_report = classification_report(list(y_test_ori), Pred_Label, labels=labels) roc = roc_auc_score(y_test_ori, Pred, average='weighted', multi_class='ovr',labels=labels) print(f" ROC score: {roc}")

super_ask
  • 225

1 Answers1

2

In this blog post you can find a review of those metrics, it also mentions the weighted metrics that you use. If you look closely, same as in your case, accuracy and weighted recall are equal in their example. They would always be equal by definition as you will see below.

Let me use the data example from the blog post linked above.

import numpy as np
from sklearn import metrics

Constants

C="Cat" F="Fish" H="Hen"

True values

y_true = [C,C,C,C,C,C, F,F,F,F,F,F,F,F,F,F, H,H,H,H,H,H,H,H,H]

Predicted values

y_pred = [C,C,C,C,H,F, C,C,C,C,C,C,H,H,F,F, C,C,C,H,H,H,H,H,H]

C = metrics.confusion_matrix(y_true, y_pred) print(C)

print(metrics.classification_report(y_true, y_pred, digits=3)))

prints the following

[[4 1 1]
 [6 2 2]
 [3 0 6]]
              precision    recall  f1-score   support
     Cat      0.308     0.667     0.421         6
    Fish      0.667     0.200     0.308        10
     Hen      0.667     0.667     0.667         9

accuracy                          0.480        25

macro avg 0.547 0.511 0.465 25 weighted avg 0.581 0.480 0.464 25

Now, let's calculate the quantities by hand. First, notice that in the confusion matrix C the true labels are in rows and predicted ones in columns. Accuracy is simple, we have the true positive counts in diagonal, so we divide them by the total number of samples:

np.sum(np.diag(C)) / np.sum(C)
# 0.48

Recall that recall is defined as the ratio between true positives and the true labels (size of the class), i.e.

np.diag(C) / np.sum(C, axis=1)
# array([0.66666667, 0.2       , 0.66666667])

If you look at the scikit-learn's documentation, weighted recall uses a weighted average weighting the per class recall scores by the size of the class i.e. you calculate something like this

np.sum(np.diag(C) / np.sum(C, axis=1) * np.sum(C, axis=1)) / np.sum(C, axis=1).sum()
# 0.48

Did you notice something fancy about the calculation? There's this part / np.sum(C, axis=1) * np.sum(C, axis=1) that cancels out. We divide by the class size to calculate recall per class, then we multiply by the class size to weigh the results. Also np.sum(C, axis=1).sum() can be reduced to np.sum(C), so we can simplify and rewrite the whole thing to

np.sum(np.diag(C)) / np.sum(C)
# 0.48

Does it look familiar? This is accuracy.

TL;DR as mentioned in the linked blog post, using the micro-F1, micro-precision, and micro-recall does not make much sense, since they are equal to accuracy. The same applies to weighting recall by the class size, as it is just an unnecessary complicated way of calculating the accuracy.

Tim
  • 138,066