2

I am trying to predict the processing time of a process by using xgboost regression algorithm in python. However I realised that my samples data is skewed to left and my algorithm struggles to predict longer process times (right side of histogram). I want to oversample the values I lack currently.

Histogram of Process Time

I have been looking at SMOTE algorithm for oversampling. All the examples I could find worked categorical y values. But my y values are continuous and I want to produce similarly continuous values. My current variables have continuous and binary values in them.

Variable Data

Output Data

I want to oversample by creating new process times while keeping variables constant. Do you have any suggestions to oversample such data? Thanks.

1 Answers1

3

Oversampling will mainly bias your predictions. This thread looks at oversampling in the context of "unbalanced" classification, but it applies to your situation, too.

Your problem appears to be that some processes run longer not necessarily systematically, but because of residual variation. (Variability may also depend on predictors.) If it is important to you to be prepared for longer processing times, then you should not predict mean process times, but "worst cases", i.e., do quantile predictions and predict (say) a number such that 95% of your processes will take less time. This will give you a safety cushion. You can train your model for quantiles using the pinball loss. See also our tag.

Stephan Kolassa
  • 123,354