5

I applied two regression models (ordinary least square (OLS) and linear absolute regression) to the same dataset, where this dataset is split into train and test sets.

Two performance measures are used to check the accuracy of linear regression models:

MSE stands for mean square error.

MAD stands for mean absolute deviation.

I found that the model fit by OLS will have a lower MSE value on unseen data than one fit using linear absolute regression? 

On the other hand, the model that is fitted by linear absolute regression will have a lower MAD value on unseen data than the one that is fitted using OLS.

Therefore, if I use the MSE as a regression performance measure, I will end up saying that the OLS model is the best, and contradictorily, if I use MAD as a regression performance measure, I will say that absolute linear regression is the best?

There is a claim by my colleagues that performance metrics will prefer regression models with the same type of loss minimization. In other words, if I have a comparison study, I can't use MAD alone and say that this absolute linear regression is the best choice because if I report MSE, the OLS model is better.

In my recent question, Dave's answer showed counter-examples on both real and simulated datasets.

My question is as follows:

If their claim is not correct how would you phrase a counter-argument in a few sentences and in a logical manner, not only R codes? If they are correct, why does this happen and how?

jeza
  • 2,089
  • 3
  • 25
  • 43
  • 4
    Since “bias” has a technical meaning in statistics, it might be helpful to explain that with a different word (e.g., “preference”). – Dave Apr 02 '22 at 18:08
  • 4
    That's not bias. It's using a scoring function that is consistent for a given functional. If you're not willing to predict whole distributions, then there is no theoretical way to derive a "universal" measure: your choice of a scoring function is a substantive decision that expresses an implicit interest in predicting a specific functional of the distribution, and using inconsistent scoring functions simply leads to improper scoring. – Chris Haug Apr 02 '22 at 19:09
  • Related: https://stats.stackexchange.com/questions/470626 – Richard Hardy Apr 03 '22 at 07:25
  • 1
    Feels like nice question, could use better title. – Nuclear241 Apr 03 '22 at 07:50
  • This is a pretty major rewrite of the original question, abandoning the issue of "bias" altogether... also, @Dave 's answer to the previous question now directly answers this question, and not just with R codes; read it in its entirety and you will see that. – jbowman Apr 03 '22 at 15:39
  • 1
    Is there something lacking in the existing answers? I’d like to address any remaining concerns you have and close out this question. – Dave Apr 08 '22 at 10:39

2 Answers2

6

Dave's answer has nothing to do with whether there's bias in an in-sample metric vs. an out-of-sample metric when the algorithm optimizes the in-sample metric. His answer addresses whether minimizing the in-sample metric necessarily also minimizes the (expected) out-of-sample metric (Edit: it doesn't); it says nothing about the relative values of the two. The bias issue states that if you do minimize an in-sample metric, the corresponding out-of-sample metric can be expected to be worse; it says nothing about whether some other objective function could improve the OOS metric.

jbowman
  • 38,614
  • 2
    I’d argue that the whole point of my answer is that training with a different loss function than we use out-of-sample might result in better out-of-sample performance. – Dave Apr 02 '22 at 18:01
  • 4
    @Dave - agreed, but it has nothing to do with the bias issue. Added an edit to include your point. – jbowman Apr 02 '22 at 18:03
  • 2
    @jbowman, the idea of my main question is more simple. Is it correct that on the same dataset, if I use the MSE as a regression performance measure, I will end up saying that the OLS model is the best, and contradictorily, if I use MAD as a regression performance measure, I will say that absolute linear regression is the best? If yes or no, why does this happen and how? – jeza Apr 03 '22 at 02:20
  • 2
    @jeza you can create any dataset for which the metric for out-of-sample data prefers any method of your choice, because it is out-of-sample, and doesn't have to follow the same distribution as the in-sample dataset. Perhaps the effect you're looking for is with the assumption that the out of sample follows the same distribution really closely as the in-sample. But then it's obvious that if the metric is the same as the optimized metric, it will be the best, since it's essentially the same data. – justhalf Apr 03 '22 at 10:35
  • @justhalf, I do not think you are right. Dave's linked answer elaborates why. There is no need to assume different distributions in the in-sample vs. out-sample to get "surprising" or "conflicting" results like the ones discussed in this thread. – Richard Hardy Apr 06 '22 at 06:19
  • @RichardHardy I guess it was indeed my mistake to claim that "it will be the best" since it gives the impression that it will guarantee minimum error for OOS data. A possible reason on why the model can lose is overfitting. But what I meant is that in general if you're trying to optimize OOS metric A, then you optimize metric A when training as well. Of course it's possible, given statistical noise, for some other model with worse training loss to outperform it during test, but there is no theoretical basis to do so. But I can concede that empirically one can do that with validation data. – justhalf Apr 06 '22 at 07:30
  • 2
    @justhalf, you are not getting the essence of it. Optimal training loss function does not generally coincide with the evaluation loss function, even under ideal conditions where the training sample comes from exactly the same data generating process as the evaluation sample. This is a fascinating fact; it is worth trying to understand it. A good starting point is estimating the true median (which is the minimizer of the expected absolute loss in the evaluation sample) by the sample mean (which is the minimizer of square loss in the training sample) – NOT the median – when the DGP is normal. – Richard Hardy Apr 06 '22 at 07:45
  • Oh, I was assuming that the question here is about the case where the eval loss is the same as training loss. My bad then. Apologies! – justhalf Apr 06 '22 at 09:04
0

Since the out-of-sample data are different from the in-sample data, all bets are off when it comes to what the out-of-sample metric chooses as its preferred model. In some sense, we are tuning the in-sample loss function as a hyperparameter in order to achieve the best out-of-sample performance on our metric of choice. If the out-of-sample metric prefers a model trained with a different loss function, so be it! That’s why we tune the hyperparameter.

I would present the argument like that. I also would be comfortable giving simulations or empirical data where, for example, out-of-sample square loss prefers a model that was trained with absolute loss than with square loss (such as the examples I gave with Iris and simulations).

Dave
  • 62,186