I have set of features to relate to two different values. When I made a regressor for only one it worked well but if i use two it does not?

Question

I have a set of 33x1 features (x) and they can be related to different two values in (y) and I have 1203985 observations. Using np.shape() you can see the dimensions of x and y. x= (1203985, 33) y=(1203985, 2)

I used a random forest regressor(100 trees) but there I only use one value of y so y being y=(1203985, 1) and it worked well but with some big errors sometimes. However for my application, I need the other value so I inputted y=(1203985, 2) into the random forest regressor and it gave some very unusable results. How come it worked very well for nx1 but not for nx2?. The two values in nx2 are almost not related and have little to no effect on each other and maybe that is the mistake I make? If I shouldn't use random forest here what method should I use (on scikit learn) if I want that kind of learner? How can I analyze what algorithm I should use with respect to my data?

For the preprocessing, I only checked if my features have low variance. I did not scale my data as my y is between [0:1]. This is my first machine learning project so I found this adequate.

I hope my problem is clear.

Dave · Accepted Answer · 2022-04-18T10:34:40.807

You don’t necessarily have a good predictor of the other $y$ variable, so while you might have good ability to predict $y_1$, you might lack much ability to predict $y_2$. Part of your job as a data scientist is to be honest about not being able to make accurate predictions.

Another possibly is that your software does not like the bivariate $y$ variable. Check the documentation, and see if you can predict $y_2$ on its own. If you have good ability to predict $y_1$ on its own and $y_2$ on its own, that would point to a software issue.

(If you’re satisfied with your performance predicting $y_1$ on its own and $y_2$ on its own with a separate regression, then you don’t need to do the multivariate regression model of both at the same time. While you might get even better results by doing both in one model, the particulars of the problem might make the performance improvement not worth the trouble.)

I have set of features to relate to two different values. When I made a regressor for only one it worked well but if i use two it does not?

1 Answers1