You say that you don’t want to use a metric that is going to prefer a particular model, so out-of-sample (R)MSE is out because it will prefer the model that was trained with square loss. Au contraire! Let’s do a simulation and show a model trained by minimizing square loss having greater out-of-sample square loss than a model trained by minimizing absolute loss.
library(quantreg)
library(MASS)
set.seed(2022)
N <- 100
Define correlated predictors
X <- MASS::mvrnorm(100, c(0, 0), matrix(c(1, 0.9, 0.9, 1), 2, 2))
Define response variable
ye <- 3 - X[, 1] + 2*X[, 2]
e <- rt(N, 1.1) # error term, t-distributed with heavy tails for outliers
y <- ye + e
Allocate the first 20 observations to testing and the rest to training
X_test <- X[1:20, ]
y_test <- y[1:20]
X_train <- X[21:N, ]
y_train <- y[21:N]
Define an OLS linear model
x1 <- X_train[, 1]
x2 <- X_train[, 2]
L1 <- lm(y_train ~ x1 + x2)
Define a linear model trained using MAE
L2 <- quantreg::rq(y_train ~ x1 + x2, tau = 0.5)
Make predictions for the test set using each of the two models
p1 <- predict(L1, data.frame(x1 = X_test[, 1], x2 = X_test[, 2]))
p2 <- predict(L2, data.frame(x1 = X_test[, 1], x2 = X_test[, 2]))
Calculate the MSE for both sets of predictions
mse1 <- mean((y_test - p1)^2)
mse2 <- mean((y_test - p2)^2)
print(paste(
"OLS has MSE = ", mse1 # I get ~10.1
))
print(paste(
"Median quantile regression has MSE = ", mse2 # I get ~7.7, so lower
# than the OLS model gave
))
In this setup, which uses correlated features and outliers (from the t-distributed error term), the OLS regression has worse out-of-sample MSE than the MAE-trained regression, showing that out-of-sample MSE need not prefer the model that minimized in-sample MSE.
Consequently, if you believe out-of-sample square loss to be the metric of interest, go with that. Since you are not assured of picking a model that you trained by minimizing square loss, it is worth your time to investigate such models.
It might be surprising for the model trained on the out-of-sample metric to lose to a mode trained on a different metric, but it should not be. This is what happens when we apply, for instance, ridge regression. In ridge regression, we minimize a loss function that is slightly different from square loss, hoping that this alternative gives us better out-of-sample performance on regular square loss than the model trained on regular square loss.
For an out-of-sample evaluation metric (perhaps even for in-sample training), you may be interested in relative error metrics. “Mean absolute percent deviation” is the easiest to understand. The Wikipedia page discusses shortcoming of this metric and potential alternatives. Our Stephan Kolassa has a nice discussion about this topic, too.
You also mentioned wanting to evaluate the accuracy of coefficient estimates. We can investigate this with another simulation, which will show the MSE-trained model (OLS) to have greater MSE when it comes to coefficient estimates than the MAE-trained model.
library(quantreg)
library(MASS)
library(ggplot2)
set.seed(2022)
N <- 100
R <- 1000
b0 <- 3
b1 <- -1
b2 <- 2
M1 <- M2 <- matrix(NA, R, 3) # Matrices for holding estimated coefficients
for (i in 1:R){
Define correlated predictors
X <- MASS::mvrnorm(100, c(0, 0), matrix(c(1, 0.9, 0.9, 1), 2, 2))
Define response variable
ye <- b0 + b1X[, 1] + b2X[, 2]
e <- rt(N, 1.1) # error term, t-distributed with heavy tails for outliers
y <- ye + e
Define OLS and MAE regression models
L1 <- lm(y ~ X[, 1] + X[, 2])
L2 <- quantreg::rq(y ~ X[, 1] + X[, 2], tau = 0.5)
Save the coefficients to a its model's respective matrix
M1[i, ] <- summary(L1)$coef[, 1]
M2[i, ] <- summary(L2)$coef[, 1]
print(i)
}
Evaluate all six coefficient MSE values
ols_0 <- mean((b0 - M1[, 1])^2)
ols_1 <- mean((b1 - M1[, 2])^2)
ols_2 <- mean((b2 - M1[, 3])^2)
mae_0 <- mean((b0 - M2[, 1])^2)
mae_1 <- mean((b1 - M2[, 2])^2)
mae_2 <- mean((b2 - M2[, 3])^2)
print(paste(
"OLS has intercept MSE of ", ols_0 # I get ~51.2
))
print(paste(
"Quantile regression has intercept MSE of", mae_0 # I get ~0.03
))
print(paste(
"OLS has X1 MSE of ", ols_1 # I get ~203.1
))
print(paste(
"Quantile regression has X1 MSE of", mae_1 # I get ~0.14
))
print(paste(
"OLS has X2 MSE of ", ols_2 # I get ~188.2
))
print(paste(
"Quantile regression has X2 MSE of", mae_2 # I get ~0.15
))
For all three parameters, the OLS regression has a much higher parameter estimate MSE than the MAE-optimizing regression (quantile regression at the median).
No matter what you do, give it context. As I discuss here, there is no universal metric that lets you "grade" a model like we all got or still get (or assign, for those readers who teach) grades in school.
EDIT
(I liked what I wrote in the comments, so I’m adding it to my answer.)
As you can see in my answer, the choice of training loss function does not guarantee a particular result when it comes to out-of-sample performance. Therefore, if you have a reason to be interested in a particular type of out-of-sample performance, go with the mode that does the best on that metric. If the model happens to be the model that uses the out-of-sample metric as the training loss function, so be it, but you hardly assure yourself of a particular mode winning by deciding that out-of-sample MSE is the metric of interest. Perhaps think of the training loss as a hyperparameter.
In fact, when you tune a regularization hyperparameter to achieve the best out-of-sample performance, what you’re doing is exactly that: treating the training loss function as a hyperparameter you tune in order to achieve your goal of excellent out-of-sample performance, even at the expense of in-sample performance. I find it completely reasonable that this extends to markedly different loss functions like MSE vs MAE (as opposed to MSE vs MSE with the ridge regression penalty added), and my simulation shows that can work the way I suspect it can.
What your results show is that, when you pick MSE as the out-of-sample metric of interest, the best-performing model is the one that was trained with in-sample MSE. That’s fine. You didn’t guarantee that result by picking MSE as the out-of-sample metric; it just worked out that way.
EDIT 2
In our chat, you have expressed dissatisfaction with only showing my counterexample through a simulation, rather than with real data. However, it can happen with real data, too.
library(quantreg)
library(ModelMetrics)
set.seed(2022)
data(iris)
N <- dim(iris)[1]
idx <- sample(seq(1, N, 1), N, replace = F)
XY <- iris[idx, ]
x_test <- XY$Petal.Length[1:20]
y_test <- XY$Petal.Width[1:20]
x_train <- XY$Petal.Length[21:N]
y_train <- XY$Petal.Width[21:N]
L_ols <- lm(y_train ~ x_train)
L_mae <- quantreg::rq(y_train ~ x_train, tau = 0.5)
pred_ols <- predict(L_ols, data.frame(y_train = y_test, x_train = x_test))
pred_mae <- predict(L_mae, data.frame(y_train = y_test, x_train = x_test))
ModelMetrics::mse(y_test, pred_ols) - ModelMetrics::mse(y_test, pred_mae)
I get that the OLS-based out-of-sample MSE is $0.00149052$ higher than the MAE-trained model's MSE, showing with real data that training on square loss does not guarantee a winning model when it comes to out-of-sample square loss, even on real data. Therefore, when your coleagues argue that training on square loss guarantees a winner on out-of-sample square loss, they are wrong.
EDIT 3
If you run the same kind of regression but with Sepal instead of Petal, you get that the model trained with OLS outperforms (by $0.005632787$) the model trained on MAE when it comes to out-of-sample MAE!
library(quantreg)
library(ModelMetrics)
set.seed(2022)
data(iris)
N <- dim(iris)[1]
idx <- sample(seq(1, N, 1), N, replace = F)
XY <- iris[idx, ]
x_test <- XY$Sepal.Length[1:20]
y_test <- XY$Sepal.Width[1:20]
x_train <- XY$Sepal.Length[21:N]
y_train <- XY$Sepal.Width[21:N]
L_ols <- lm(y_train ~ x_train)
L_mae <- quantreg::rq(y_train ~ x_train, tau = 0.5)
pred_ols <- predict(L_ols, data.frame(y_train = y_test, x_train = x_test))
pred_mae <- predict(L_mae, data.frame(y_train = y_test, x_train = x_test))
ModelMetrics::mae(y_test, pred_ols) - ModelMetrics::mae(y_test, pred_mae)
Now you have examples, using real (not simulated) data, of an MSE-trained model outperforming an MAE-trained model on out-of-sample MAE and of an MAE-trained model outperforming an MSE-trained model on out-of-sample MSE.