Handling imbalanced data for regression based tasks

Question

I have an imbalanced google analytics dataset. I'm interested in predicting total.tranactionRevenue but, of the 70,000 data points only 700 have transactions. The value of these transactions varies from 0 - 20 with a normal distribution. Below is a sample of the data table.

I've tried to remedy this by oversampling the samples with transactions via simple duplication of the minority (samples with Revenue). This seems to have partly improved the model performance as it now outputting non 0 tranactions, I was wondering if there are any other methods I could look into? (python based solutions would be a bonus)

The usual techniques are reviewed here: https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html. — Mark Pundurs, Jan 26 '21 at 18:13
Thanks Mark, number 6 (Cluster the abundant class) seems especially interesting. I'll give it a whirl. — Zizi96, Jan 26 '21 at 23:05

Dikran Marsupial · Answer 1 · 2023-01-08T17:27:06.477

The approach I would take would be to have two (or possibly three) outputs, one of which predicts the probability of there being a transaction and the other of which predicts the conditional mean (and perhaps conditional variance) of the value of the transaction should there be one. It is relatively straightforward to construct a likelihood for both tasks at the same time, and is an approach I have used successfully for modelling rainfall (based on the work of Peter Williams).

Peter M. Williams, "Modelling seasonality and trends in daily rainfall data", NIPS'97: Proceedings of the 10th International Conference on Neural Information Processing Systems, December 1997 Pages 985–991 (www)

That approach has the advantage of not needing any resampling, which is often of dubious value as it means the training set is no longer statistically representative of operational conditions and is likely to over-predict transaction value unless some other correction is made.

In these sort of situation is is vital to understand what performance criterion is important for operational use and why, as otherwise it is easy to be led astray pursuing performance statistics that don't measure what is really important, and end up with a system that looks good on paper but doesn't actually do anything useful. This is very common with imbalanced classification problems, where optimal accuracy is often achieved by ignoring the minority class. If this is not acceptable it means that false negatives are a "worse" error than false positives, which means accuracy is not the right performance metric and you should be using something like expected loss (effectively a cost-weighted accuracy). The imbalance isn't a problem, the problem is using a performance metric that doesn't measure what you really want to measure.

Handling imbalanced data for regression based tasks

1 Answers1