0

I am modeling a dependent variable which is heavily right skewed by a large number of independant variables. This variable is integer. But let's assume this is our model. $ Y = a_0 X_0 + a_1 X_1 + b_0$ . It seems I have a few options:

- Max-limiting the dependent variable (capping). We can build a model where, we model $max(Y,max_Y) = a_0 X_0 + a_1 X_1 + b_0$ instead. This might impact our model results for genuinely large Ys. If we dont do so, the results for small Ys will be distorted by the model.

Log transforming dependent variable. I have not done that. But I think that it is a good idea to transform the feature to a more healthier distribution. Then we can transform it back to the correct value using exponentiation. What are potential issues with this technique? Our team did Poisson regression and it wasn't really impressive.

** Some sort of quantile regression ** Let's say for 4 quantiles of $Y$ we build 4 models, then use predict Y for all those models and choose the one that Y_pred falls within its quantile definition. What are issues with such methodology?? any rigor issue?

I appreciate any additional thoughts and ideas. Maybe journal/medium articles or anything that would help me deal with skewed dependent variable.

p.s.: we are also modeling the same input/output using a deep model. any idea on that front is also appreciated.

Richard Hardy
  • 67,272
aghd
  • 314

0 Answers0