Am I allowed to average the list of precision and recall after k-fold cross validation?

Question

I have created a 5-fold cross validation model and used cross_val_score function to calculate the precision and recall of the cross validated model as follows:

def print_accuracy_report(classifier, X, y, num_validations=5):
    precision = cross_validation.cross_val_score(classifier, 
            X, y, scoring='precision', cv=num_validations)
    print "Precision: " + str(round(100*precision.mean(), 2)) + "%"


    recall = cross_validation.cross_val_score(classifier, 
            X, y, scoring='recall', cv=num_validations)
    print "Recall: " + str(round(100*recall.mean(), 2)) + "%"

I wonder if I'm allowed to do these lines:

    print "Precision: " + str(round(100*precision.mean(), 2)) + "%"
    print "Recall: " + str(round(100*recall.mean(), 2)) + "%"

I mean does this precision.mean() and recall.mean() represent the precision and recall of the whole model?

Just for comparison's sake, in the scikit-learn's documentation I've seen the model's accuracy is calculated as :

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)


print(scores)

array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])

    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)

Nuclear Hoagie · Accepted Answer · 2019-01-24T16:36:39.717

11

First of all, when you do 5-fold cross validation, you don't have one model, you have five. So it's not really correct to talk about the precision/recall of the "whole model" since there isn't just one. Rather, you're getting an estimate of the precision/recall from your model-building process.

That said, each fold is a model with its own precision and recall, and you can average them to get a mean performance metric over all your folds. One thing to note, though, that since recall is the proportion of true positives out of all positives, you'll have to weight each fold by the number of positives.

Imagine a case where you have 4 folds that each have only one positive, which is correctly identified, giving you 100% recall on those folds. The fifth fold has 96 positives, 46 of which are correctly identified, for 48% recall . A straight mean would give you a recall of 90%, but if you account for the greater number of positives in the fifth fold, your overall recall rate is only 50% (50 of 100 positives identified). If your folds are well-stratified, this problem would fix itself for recall, but for precision, which depends on the number of predicted positives in each fold, I don't see any way to stratify before doing prediction (you'd have to know the prediction output before defining the folds and training the model). I would implement the weighted average method, as it will work for any metric you choose to compute, and will work in cases where perfectly equal stratification isn't possible (when N isn't evenly divisble by K).

Another approach suggested in the comments, which would be equivalent to weighted averages of summary metrics, is to sum the prediction confusion matrices from each fold and compute summary statistics from the combined matrix. By summing the TP, TN, FP, and FN from all folds and then computing precision/recall, you are implicitly accounting for any differences in prevalence of positive cases or positive predictions across folds.

edited Jan 24 '19 at 16:36

answered Jan 24 '19 at 14:59

Nuclear Hoagie

9,297

Thanks. I didn't notice the proportion of positives in each fold. So: 1)From your answer, After making sure that the folds are stratified, I concluded that you confirm the .mean() operation is authentic. Right? 2) How can I make them stratified as you said? Should I use precision_weighted and reacall_weighted for scoring parameter? Should I use StratifiedKFold? – hyTuev Jan 24 '19 at 15:09
The mean operation should work for recall if the folds are stratified, but I don't see a simple way to stratify for precision, which depends on the number of predicted positives (see updated answer). Not too familiar with the scikit-learn functions, but I'd bet there is one to automatically stratify folds by class. To do it manually, you could separate all your samples by class, and then put a random 20% of each list into one fold - that will give you 5 folds which each have consistent class proportions. – Nuclear Hoagie Jan 24 '19 at 15:22
1

I think another way you could avoid the issue Nuclear describes is to simply calculate one measure of precision/recall from all of the 5 runs combined. Per their example, if you have 1 TP in the first 4 folds and 46 TP in the 5th then adding this up you have 50 TP. You have also 50 FN so you get the 50% recall rate they describe. – astel Jan 24 '19 at 16:18
@astel . Yeah I understand what you are saying. Just curios. Do you know if scikit-learn has implemented this method or not? – hyTuev Jan 24 '19 at 19:36
I don't use scikit-learn so I can't comment on that, though my guess would be no. Assuming you can retrieve the true/predicted values for each record in each fold though it shouldn't be too hard to program something that does that. – astel Jan 24 '19 at 19:55
"that since recall is the proportion of true positives out of all positives" that would be the precision, wouldn't it? Recall is TP/(TP+FN). – gented May 06 '20 at 09:30
@gented I meant that recall is the proportion of TP out of all actually positive samples, rather than the proportion of TP out of all predicted positive samples, which is precision as you point out. – Nuclear Hoagie May 06 '20 at 12:59

Am I allowed to average the list of precision and recall after k-fold cross validation?

1 Answers1

Linked