Splitting medical dataset by patient

Question

I am currently trying to train a CNN model to classify CT-scans.

I split the dataset using K-fold cross-validation and since the dataset I am using contains multiple slices per patient, I split the dataset by patient ID.

The problem is that since the number of slices corresponding to each patient varies, spliting by patient creates folds that are not balanced.

How can I deal with this problem? Is removing images so that all patients have the same number of images a good idea?

Thank you in advance.

Does each slice have a classification/regression label? – gunes Jun 21 '22 at 07:24 — gunes, Jun 21 '22 at 07:24
Yes, each slice has a classification label – Simos Ps Jun 21 '22 at 11:50 — Simos Ps, Jun 21 '22 at 11:50

score 1 · Answer 1 · answered Jun 22 '22 at 12:01

There's no whatsoever problem with unbalanced test folds (other than that you need to think how to properly aggregate the results - on scan vs. on patient level - but that's a consequence of the data structure having patients and slices rather than one of class imbalance).
Since the training contains (k - 1)/k folds, the training set balance is much less affected.
Last but not least, if that amount of imbalance causes problems, I'd say that this is a symptom of important underlying problems with the stability/ruggedness of your modeling approach. (see also What is the root cause of the class imbalance problem?)

Splitting medical dataset by patient

1 Answers1