Oversampling for Continuous Values

Question

I am trying to predict the processing time of a process by using xgboost regression algorithm in python. However I realised that my samples data is skewed to left and my algorithm struggles to predict longer process times (right side of histogram). I want to oversample the values I lack currently.

I have been looking at SMOTE algorithm for oversampling. All the examples I could find worked categorical y values. But my y values are continuous and I want to produce similarly continuous values. My current variables have continuous and binary values in them.

I want to oversample by creating new process times while keeping variables constant. Do you have any suggestions to oversample such data? Thanks.

Why should your model be able to catch these especially long times? What in your data distinguishes the long times from the more modest times? — Dave, Dec 07 '23 at 18:59

score 3 · Answer 1 · answered Sep 14 '22 at 06:59

Oversampling will mainly bias your predictions. This thread looks at oversampling in the context of "unbalanced" classification, but it applies to your situation, too.

Your problem appears to be that some processes run longer not necessarily systematically, but because of residual variation. (Variability may also depend on predictors.) If it is important to you to be prepared for longer processing times, then you should not predict mean process times, but "worst cases", i.e., do quantile predictions and predict (say) a number such that 95% of your processes will take less time. This will give you a safety cushion. You can train your model for quantiles using the pinball loss. See also our prediction-interval tag.

Oversampling for Continuous Values

1 Answers1