0

I built a model with 2 variables, i found that the predicted values for the test set did not align with their real values, so i predicted instead with the data which were used to train the model. I had the following results: enter image description here

I don't have a lot of experience in ML, but this shouldn't be like this. The predicted values should sit around the red line.

What could be the source of it ? Is it that my model is unsuficcient in terms of information brought by the predictors ?

It is not an error of programming nor an error with the data as i have already looked inside of it.

1 Answers1

0

It was probably a stupid question.

Anyway here is the answer:

I made the model again, this time with all the variables i have at hand. And then only did the predicted values sit around that diagonal line (at least it is way better). So it just means that my two-variables model doesn't bring enough information.

enter image description here

EDIT: Adding this 3d picture to show how the response variable vary as a function of my two explanatory variables. (z= response)

enter image description here

The plane show the way the regression is performed. Could it be that it looks like the problem of regression to the mean, but is not, and simply due to the structure of the data ?

  • Good question--but wrong answer. Please see our posts on "regression to the mean." In addition to the duplicates (concerning this phenomenon with data), the figures in my theoretical account (concerning bivariate distributions) might be helpful. – whuber Aug 11 '23 at 18:28
  • Thanks for your answer but i truly don't understand. Does that mean that my response variable increments way too much in comparison to my predictors ? – Renaud Bied-charreton Aug 13 '23 at 19:35
  • I don't understand how it apply to my case. I have read many topics about regression to the means, thanks to your link, but it seems to me it is mainly about bias in the experimental design. Here, i just want to predict a variable with a model. The response variable and the independant variables were measured at the same time and place. – Renaud Bied-charreton Aug 13 '23 at 21:33
  • For exermple, this post https://stats.stackexchange.com/a/404310/386070 , as i seem to understand it, says that if a variable is caracterized by a normal distribution, then predicting it will mainly lead to predict closer to the mean.

    The consequence would be: If the response variable has a normal distribution, it is useless to try to predict anything else than the mean. Is that it ?

    – Renaud Bied-charreton Aug 13 '23 at 21:41
  • Your illustrations are nice examples of regression to the mean. No other explanation is necessary. Regression is not solely a property of bivariate Normal distributions--it is a universal phenomenon--but is most easily illustrated and explained with them. – whuber Aug 14 '23 at 13:34
  • But why does it occur specifically with my data ? Would it help if I sent you a sample of the data ? – Renaud Bied-charreton Aug 14 '23 at 13:57
  • Regression to the mean is a mathematical phenomenon: it occurs with all bivariate data, no matter how they might have been produced. – whuber Aug 14 '23 at 14:00
  • OK, then if i had only one variable in the model, it wouldn't occur, i guess.

    But why does this effect seems to disappear with overfitted model ? (60 variables instead of 2, for 200 individuals)

    – Renaud Bied-charreton Aug 14 '23 at 14:02
  • It doesn't disappear: it just become lesser, because "overfitted" and "high $R^2$" are nearly the same thing. – whuber Aug 14 '23 at 14:28
  • You will be starting to think i'm stubborn (or stupid) but this study here talks about RTM happening in pre-post studies, which is not my case. And any subjects talking about RTM is about pre-post or following the change in value through the occurences.

    I can't get to understand how and why it would be RTM in my case. I'm really sorry, i'm just doing by best to understand the source of the underlying problem

    https://pubmed.ncbi.nlm.nih.gov/30743311/#:~:text=Regression%20to%20the%20mean%20for%20the%20bivariate%20binomial,an%20inaccurate%20conclusion%20in%20a%20pre-post%20study%20design.

    – Renaud Bied-charreton Aug 14 '23 at 14:52
  • "Happening in pre-post studies" does not mean "happens only in pre-post studies!" – whuber Aug 14 '23 at 17:33
  • If the goal is to have good prediction performances, what would you do in my place ? Would you choose another method for predicting ? (shrinkage, variable transformation, ensemble models...)

    Looking more and more at the 3D graphics of X,Y and Predicted, i can see more and more why this regression is a natural consequence, but it is therefore not so good to predict i guess. My final word would that this result is natural, but variable choice is wrong.

    – Renaud Bied-charreton Aug 17 '23 at 14:24
  • Nobody can answer that because it depends on what you're predicting, on why, and on how accurate you need your predictions to be. – whuber Aug 17 '23 at 14:25