1

Here's my idea and doubt

There are two purposes of machine learning, inference, and prediction. In prediction, we are interested in finding a model to give us the best accuracy when we try to find the forecast for a new data point. In inference, the idea is to understand the relationship between input variables and output variables.

However, for a particular problem, we assume to have a single underlying data-generating process. In that case, won't fitting different models for inference and prediction be a violation of this assumption? For example, we can fit linear regression for inference and a normal distribution-based model for prediction but underlying DGP can be only one (in the best possible scenario). In that case, does it make sense to do so?

mw981
  • 11

2 Answers2

1

No, even when we assume a single data generating process and even when it's a very simple data generating process, it's quite possible for predictive models to be very different from models estimating the effect of a specific variable.

Suppose we have a very simple data generating process

x~Bernoulli(0.5)
z<- x+N(0,1)
y<- z+N(0,1)

where all the arrows represent truly causal relationships. Under this data generating process, y is independent of x given z, so the best predictive model is y~z. If we put x in this model it would have a coefficient of 0. x is not predictive.

However, if we want to estimate the effect of x on y, this effect is not zero. Individuals with x==1 have higher y than those with x==0, by 1 unit on average, and this is causal. We need to fit the model y~x and not the model y~z or y~x+z to estimate the effect of x on y.

If you constructed a model describing the full data-generating process, you could derive both the best predictive model y~z and the best model for the effect of x, y~x from it. These models are different, and they would be different whether or not you knew the exact data-generating process.

Thomas Lumley
  • 38,062
  • So basically we treat prediction vs inference independently. But then there can be the case that the inference model suggests something say(a linear increase of 1000 with a unit increase in z) but the predictive model might not agree (might show a quadratic trend). How do we avoid such scenarios? How can we claim the inference model is correct if the predictive model is not agreeing with the inference claims? – mw981 Apr 03 '23 at 18:41
0

First of all, I would argue that for making predictions in machine learning in many cases you don't really need to care about the data-generating process. If you are using something like $k$NN regression, the only thing that you're doing is predicting the mean of the most similar datapoints in the training data, it doesn't make any assumptions about the data-generating process. The same applies to most of the other machine learning.

Moreover, keep in mind that the idea of a data-generating process is just an abstraction helping us to formalize the statistical model for the particular data. There is no "the" data-generating process. No data was "generated" from something like Gaussian (or any other) distribution because there is no such thing in nature as Gaussian distribution, it's a mathematical concept.

For the purpose of inference, we want our model to be simple, so it is easy to interpret. For making predictions, we want it to be accurate, possibly at the price of being less interpretable. Neither of the models is single correct, they are both wrong, but each is useful in its own way.

Tim
  • 138,066
  • Thanks for the very informative answer. The second paragraph drives the point almost home. However, I am still having difficulty coming to terms with both being fitted independently. I expect if the inference model says that y should vary linearly, the predictive model should show that trend. On the other hand, if the predictive model suggests that variation is quadratic our inference model should also deal with quadratic terms. The linear model will not make sense. So there is a dependency in both choices in that sense. Is my reasoning correct? – mw981 Apr 03 '23 at 18:48
  • @mw981 not sure what you mean. It's definitely possible to fit two different models to the same data. – Tim Apr 03 '23 at 19:19
  • agreed it's possible. What I mean is suppose we find the best predictive model is a quadratic linear model, in that case, if we derive inference insights from a linear model, will it be correct? Shouldn't I use a quadratic structure for the inference model as well? – mw981 Apr 03 '23 at 19:26
  • @mw981 how do you define “best”? – Tim Apr 03 '23 at 19:35
  • by best I mean the model with the least error on the test set. – mw981 Apr 03 '23 at 19:47
  • @mv981 that's not the definition that is of any relevance for an inference model since you are not going to use it to make out-of-data predictions. – Tim Apr 03 '23 at 19:52
  • I meant that for the predictive model. For the best inference model, my least expectation would be to be consistent with trends that are shown by the predictive model. Suppose we are regressing the number of friends against salary and the best predictive model suggests that the increase in friends with salary is quadratic. Now, suppose we fit linear regression and make a statement that with a unit increase in salary, x is the expected linear increase in the number of friends. But the predictive model suggests a quadratic increase. Shouldn't we fit a quadratic model to infer properly here? – mw981 Apr 03 '23 at 20:16
  • @mw981 it doesn't “prove” that such a model would be best for inference. Formula 1 may be the fastest, but if you would like to go off-road it would be hardly the best or fastest. – Tim Apr 03 '23 at 20:23