Academic research on MSE as cost function for regression in deep learning

Question

I always see blogs or youtube authors saying that MSE should be used in regression problems, especially when dealing with time series.

For example in this site the author says:

The Mean Squared Error, or MSE, loss is the default loss to use for regression problems. Mathematically, it is the preferred loss function under the inference framework of maximum likelihood if the distribution of the target variable is Gaussian. It is the loss function to be evaluated first and only changed if you have a good reason.

Ok, but why? What is a good reason for changing it? Unfortunately, the author does not cite his references for this statement.

I am looking for some papers where authors explain/compare cost functions for time series regression problems but it seems to be not so easy to find them.

It is well behaved in many senses, and is closely associated with the mean and variance. One good reason for changing it could be that you know what your loss function actually is. — Henry, Sep 04 '22 at 02:26

score 2 · Answer 1 · answered Sep 04 '22 at 06:09

2

Your loss function should be governed by what functional of the unknown distribution of the future observables you want to elicit (whether you explicitly consider that unknown distribution or not). If you want a conditional mean forecast, you should use the MSE. If you want a conditional median, you should use the MAE. If you want a conditional quantile, you should use an appropriate pinball loss. If for some strange reason you want the (-1)-median, you should use the MAPE (and make sure it is defined). I have written a little paper on this with more arguments and references: Kolassa (2020, IJF).

answered Sep 04 '22 at 06:09

Stephan Kolassa

123,354

Thank you. Another question: why don't use RMSE as a cost function instead of MSE? RMSE is always used at the end when evaluating the predictions but I never see people talk about using it as cost function – Rods2292 Sep 04 '22 at 15:24
Because it does not matter. When we optimize some measure of fit, we don't care about the value of the objective function, only about the argument of the objective at its optimum (i.e., parameter estimates or NN parameters). And since the root is a monotone function, the MSE will be minimized by the precise same parameter settings as the RMSE. The only difference is that the RMSE needs an additional function evaluation (and the root is computationally expensive). – Stephan Kolassa Sep 04 '22 at 16:11

Academic research on MSE as cost function for regression in deep learning

1 Answers1

Linked