Using data augmentation for balancing dataset

Question

Hey there,

I have a question about a topic that's been discussed many times before, but to which I could find a satisfying answer.

I'm working with a self generated dataset that is comprised of only 340 'datapoints'. The dataset is not balanced, due to experimental reasons. It consists of eleven classes that vary from 60 events to 7 events per class. Bacause of the data origin, I cannot use common augmentation algorithms, so we developed our own. I trained a model on these data. It performs pretty well and is able to generalize the problem satisfactorily. I also tested different amounts of augmentation based on the resulting performance of the model.

My question now: Is it a good idea to use data augmentation to balance out the dataset, even though it produces a well performing model? Or do I acutally don't need a balanced dataset as long as my model performs to my satisfaction?

My concern is the integrity of my model. It is to be published as part of a larger project and I just want to make sure that it stands up to the review process.

I welcome any ideas and feedback on this topic.

Does this help? Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? — Stephan Kolassa, Oct 20 '21 at 16:42
As my mathematical understanding of this probelm is very limited, it only helps to a certain point. I get your argument, that an unabalanced dataset is not a problem in itself, but only if one uses accuracy as a metric. After reading some of your posts I also understand, that you are not a big friend of accuracy. I feel a bit lost in the jungle of different metrices. Could you perhaps make a suggestion for a more suitable metric for such a multiclass classification model that also works with an unbalanced data set? — TheoBoveri, Oct 21 '21 at 06:15
I would usually recommend going with a probabilistic classifier, i.e., one that for each instance gives a predicted probability of it belonging to classes A, B, C or D (with predicted probabilities summing to 1). I would argue that the decision on what to do with this classification is a separate issue and should be informed by the costs of possibly wrong actions - even if there is a small possibility of a malignant cancer, we would want to run additional tests, rather than treat the patient as "healthy", simply because P(healthy)=0.60. https://stats.stackexchange.com/a/312124/1352 — Stephan Kolassa, Oct 21 '21 at 07:39
You can assess the quality of probabilistic classifications using proper scoring rules. The tag wiki contains information and references. Note that many scoring rules are only formulated for binary classifications, but many work just as well for multi-class situations. This thread compares the log and the Brier score with a little specific emphasis on multi-class classifications. Good luck! — Stephan Kolassa, Oct 21 '21 at 07:42
The last layer in my model acutally uses categoricalcrossentropy as a loss function. As I read elsewhere, the log loss score is also referred to as cross entropy, right? Also the output of my network runs through a softmax function, resulting in a [0:1] scoring. So do these circumstances fit to your recommendation from above to use a probabilistic classifier, or am I still missing something? — TheoBoveri, Oct 21 '21 at 08:55
That does sound promising! Then my recommendation would simply be to work directly with the scoring in $[0,1]$ and not check them against some threshold. — Stephan Kolassa, Oct 21 '21 at 09:02
I am already working directely with the [0,1] scoring. So I assume that it's fine for me working with an unbalanced dataset :) That's fantastic news. Thank's a lot for your help. — TheoBoveri, Oct 21 '21 at 09:40

Using data augmentation for balancing dataset

0 Answers0