Cross-validation with metrics such as F1 can be implemented in two ways:
For each cross-validation split, calculate F1_split on the validation dataset.
F1_result = average_by_splits(F1_split)
For each cross-validaton split, calculate confusion_matrix_split on the validation dataset.
confusion_matrix_result = sum_over_splits(confusion_matrix_split)
Calculate F1 from confusion_matrix_result.
Second method is the only possible when using Leave One Out cross-validaton.
And what method is preferrable using k-fold cross validation? Depending on k?
Links to theoretical research papers are welcome.
Updated:
I will reformulate this question:
If we compute some score based on confusion matrix, is it preferrable to
- Calculate this score for each split separately and average score over splits
- Calculate summary confusion matrix and calculate result score from it
- Not to use confusion matrix at all (and what to use in this case?)