7

I have a dataset that contains both outliers and multicollinearity.

I applied three different regression models to that dataset: ordinary least square, absolute linear regression, and Huber regression.

My goal is to test which linear regression model is good or accurate at dealing with combined outliers and multicollinearity issues in a dataset.

I cannot use mean square error or R^2 because ordinary square regression will be the best, and the same with median absolute deviation, since absolute linear regression will be the best.

Which measures can I use to judge which model is more accurate than others?

Clarification 

The dataset has been split into train and test data, where the regression model is applied to the train dataset, and the performance measure is calculated using an unseen dataset (the test set). 

I want to check how accurate the model is in estimating or predicting the coefficients.

In other words, do the outliers and multicollinearity in a dataset affect the linear regression model and thus shift the coefficients too much? 

I want a universal measure that I can judge and say, with confidence, that this linear regression model is good compared to others, even in the presence of outliers and multicollinearity simultaneously. 

UPDATE

My question, in simple words, is: if I have a dataset and this dataset contains outliers and multicollinearity, then I applied three different regression models to that dataset. I want to investigate the accuracy of these regression models based on a performance measure that is not biased towards any particular regression model. After calculating this measure, I can say with confidence that this regression model is the best one even when the dataset contains outliers and multicollinearity.

NOTE

I say one dataset solely for the sake of simplification. Over 30 databases and a hundred simulations are included in the project. Additionally, it includes more than ten regression models.

New Update after Dave answer

When I analyse this real dataset, the case is quite different from what Dave described. When MSE is used as the performance metric, OLS is the optimal choice; when MAD is used, linear absolute regression is the superior choice.

rm(list=ls())
library(L1pack) 
library(glmnet)
library(MASS)
library(robust)
library(robustbase)
library(quantreg)
library(readr)
library(readxl)
####################################### 
## mean of the k-fold results 
mee=function(x){
mmm=rep(0,ncol(x))
for (i in 1:ncol(x)){
mmm[i]=mean(x[,i])
}
return(mmm)
}
###############  Dataset  ################
wdbc <- read_excel("Folds5x2_pp.xlsx") 
###############################
x=as.matrix(wdbc[,-5])
y=as.vector(wdbc[,5])
wdbc<-as.data.frame(cbind(x,y))  
n=nrow(wdbc)
################################
# K-fold Cross-validation
k=30 # number of folds

folds <- cut(seq(1,n),breaks=k,labels=FALSE) mols1=matrix(0,nrow= k);mM1=matrix(0,nrow= k);mMM1=matrix(0,nrow= k)

split the data to train and test set

for(i in 1:k){ testIndexes <- which(folds==i,arr.ind=TRUE) testData <- wdbc[testIndexes, ] trainData <- wdbc[-testIndexes, ] xtr=as.matrix(trainData[,-5]) ytr=trainData[,5] xte=as.matrix(testData[,-5]) yte=testData[,5]

mest=rlm(ytr~xtr,psi=psi.huber,maxit=300)$coefficients mmest=rq(ytr~xtr, tau = 0.5)$coefficients ols=lm(ytr~xtr)$coefficients

MSE measure

mols1[i]=mean((yte-cbind(1,xte)%%ols)^2) mM1[i]=mean((yte-cbind(1,xte)%%mest)^2) mMM1[i]=mean((yte-cbind(1,xte)%*%mmest)^2)

Use this for MAD measure

mols1[i]=mean(abs(yte-cbind(1,xte)%*%ols))

mM1[i]=mean(abs(yte-cbind(1,xte)%*%mest))

mMM1[i]=mean(abs(yte-cbind(1,xte)%*%mmest))

} res2=cbind(mols1,mM1,mMM1) MEE=mee(res2) nam=c("OLS","Huber","Absolute Linear regression") ty=data.frame(nam,MEE,rank(MEE)) View(ty)

jeza
  • 2,089
  • 3
  • 25
  • 43
  • 6
    Your meaning of "accurate" is obscure, because it seems to combine (1) some unstated target of estimation or prediction (coefficients? fitted values? something else?) with (2) otherwise unrelated properties like "outliers" and "multicollinearity issues." Could you clarify what you mean? – whuber Mar 22 '22 at 23:03
  • 3
    The model trained using MSE might not have the best out-of-sample MSE. Ditto for other metrics like MAE. – Dave Mar 23 '22 at 09:01
  • @Dave, MAE also cannot be used because it will lead absolute linear regression to win the comparisons. – jeza Mar 23 '22 at 09:07
  • 1
    You say that, yet the model trained using MAE might not have the best out-of-sample MAE. In any event, you’re looking at coefficient accuracy, which absolutely does not have square loss always give the best coefficients. As discussed in a question of mine, MAE can give unbiased coefficient estimates that have lower variance than the coefficients given by square loss, and out-of-sample evaluation does not come up in this context. Thus, your question becomes a matter of determining what makes sense as a notion of “accurate” for your problem. – Dave Mar 23 '22 at 09:16

2 Answers2

12

You say that you don’t want to use a metric that is going to prefer a particular model, so out-of-sample (R)MSE is out because it will prefer the model that was trained with square loss. Au contraire! Let’s do a simulation and show a model trained by minimizing square loss having greater out-of-sample square loss than a model trained by minimizing absolute loss.

library(quantreg)
library(MASS)
set.seed(2022)
N <- 100

Define correlated predictors

X <- MASS::mvrnorm(100, c(0, 0), matrix(c(1, 0.9, 0.9, 1), 2, 2))

Define response variable

ye <- 3 - X[, 1] + 2*X[, 2] e <- rt(N, 1.1) # error term, t-distributed with heavy tails for outliers y <- ye + e

Allocate the first 20 observations to testing and the rest to training

X_test <- X[1:20, ] y_test <- y[1:20]

X_train <- X[21:N, ] y_train <- y[21:N]

Define an OLS linear model

x1 <- X_train[, 1] x2 <- X_train[, 2] L1 <- lm(y_train ~ x1 + x2)

Define a linear model trained using MAE

L2 <- quantreg::rq(y_train ~ x1 + x2, tau = 0.5)

Make predictions for the test set using each of the two models

p1 <- predict(L1, data.frame(x1 = X_test[, 1], x2 = X_test[, 2])) p2 <- predict(L2, data.frame(x1 = X_test[, 1], x2 = X_test[, 2]))

Calculate the MSE for both sets of predictions

mse1 <- mean((y_test - p1)^2) mse2 <- mean((y_test - p2)^2)

print(paste( "OLS has MSE = ", mse1 # I get ~10.1 ))

print(paste( "Median quantile regression has MSE = ", mse2 # I get ~7.7, so lower # than the OLS model gave ))

In this setup, which uses correlated features and outliers (from the t-distributed error term), the OLS regression has worse out-of-sample MSE than the MAE-trained regression, showing that out-of-sample MSE need not prefer the model that minimized in-sample MSE.

Consequently, if you believe out-of-sample square loss to be the metric of interest, go with that. Since you are not assured of picking a model that you trained by minimizing square loss, it is worth your time to investigate such models.

It might be surprising for the model trained on the out-of-sample metric to lose to a mode trained on a different metric, but it should not be. This is what happens when we apply, for instance, ridge regression. In ridge regression, we minimize a loss function that is slightly different from square loss, hoping that this alternative gives us better out-of-sample performance on regular square loss than the model trained on regular square loss.

For an out-of-sample evaluation metric (perhaps even for in-sample training), you may be interested in relative error metrics. “Mean absolute percent deviation” is the easiest to understand. The Wikipedia page discusses shortcoming of this metric and potential alternatives. Our Stephan Kolassa has a nice discussion about this topic, too.

You also mentioned wanting to evaluate the accuracy of coefficient estimates. We can investigate this with another simulation, which will show the MSE-trained model (OLS) to have greater MSE when it comes to coefficient estimates than the MAE-trained model.

library(quantreg)
library(MASS)
library(ggplot2)
set.seed(2022)
N <- 100 
R <- 1000

b0 <- 3 b1 <- -1 b2 <- 2

M1 <- M2 <- matrix(NA, R, 3) # Matrices for holding estimated coefficients

for (i in 1:R){

Define correlated predictors

X <- MASS::mvrnorm(100, c(0, 0), matrix(c(1, 0.9, 0.9, 1), 2, 2))

Define response variable

ye <- b0 + b1X[, 1] + b2X[, 2] e <- rt(N, 1.1) # error term, t-distributed with heavy tails for outliers y <- ye + e

Define OLS and MAE regression models

L1 <- lm(y ~ X[, 1] + X[, 2]) L2 <- quantreg::rq(y ~ X[, 1] + X[, 2], tau = 0.5)

Save the coefficients to a its model's respective matrix

M1[i, ] <- summary(L1)$coef[, 1] M2[i, ] <- summary(L2)$coef[, 1]

print(i) }

Evaluate all six coefficient MSE values

ols_0 <- mean((b0 - M1[, 1])^2) ols_1 <- mean((b1 - M1[, 2])^2) ols_2 <- mean((b2 - M1[, 3])^2)

mae_0 <- mean((b0 - M2[, 1])^2) mae_1 <- mean((b1 - M2[, 2])^2) mae_2 <- mean((b2 - M2[, 3])^2)

print(paste( "OLS has intercept MSE of ", ols_0 # I get ~51.2 )) print(paste( "Quantile regression has intercept MSE of", mae_0 # I get ~0.03 ))

print(paste( "OLS has X1 MSE of ", ols_1 # I get ~203.1 )) print(paste( "Quantile regression has X1 MSE of", mae_1 # I get ~0.14 ))

print(paste( "OLS has X2 MSE of ", ols_2 # I get ~188.2 )) print(paste( "Quantile regression has X2 MSE of", mae_2 # I get ~0.15 ))

For all three parameters, the OLS regression has a much higher parameter estimate MSE than the MAE-optimizing regression (quantile regression at the median).

No matter what you do, give it context. As I discuss here, there is no universal metric that lets you "grade" a model like we all got or still get (or assign, for those readers who teach) grades in school.

EDIT

(I liked what I wrote in the comments, so I’m adding it to my answer.)

As you can see in my answer, the choice of training loss function does not guarantee a particular result when it comes to out-of-sample performance. Therefore, if you have a reason to be interested in a particular type of out-of-sample performance, go with the mode that does the best on that metric. If the model happens to be the model that uses the out-of-sample metric as the training loss function, so be it, but you hardly assure yourself of a particular mode winning by deciding that out-of-sample MSE is the metric of interest. Perhaps think of the training loss as a hyperparameter.

In fact, when you tune a regularization hyperparameter to achieve the best out-of-sample performance, what you’re doing is exactly that: treating the training loss function as a hyperparameter you tune in order to achieve your goal of excellent out-of-sample performance, even at the expense of in-sample performance. I find it completely reasonable that this extends to markedly different loss functions like MSE vs MAE (as opposed to MSE vs MSE with the ridge regression penalty added), and my simulation shows that can work the way I suspect it can.

What your results show is that, when you pick MSE as the out-of-sample metric of interest, the best-performing model is the one that was trained with in-sample MSE. That’s fine. You didn’t guarantee that result by picking MSE as the out-of-sample metric; it just worked out that way.

EDIT 2

In our chat, you have expressed dissatisfaction with only showing my counterexample through a simulation, rather than with real data. However, it can happen with real data, too.

library(quantreg)
library(ModelMetrics)
set.seed(2022)

data(iris) N <- dim(iris)[1] idx <- sample(seq(1, N, 1), N, replace = F) XY <- iris[idx, ] x_test <- XY$Petal.Length[1:20] y_test <- XY$Petal.Width[1:20]

x_train <- XY$Petal.Length[21:N] y_train <- XY$Petal.Width[21:N]

L_ols <- lm(y_train ~ x_train) L_mae <- quantreg::rq(y_train ~ x_train, tau = 0.5)

pred_ols <- predict(L_ols, data.frame(y_train = y_test, x_train = x_test)) pred_mae <- predict(L_mae, data.frame(y_train = y_test, x_train = x_test))

ModelMetrics::mse(y_test, pred_ols) - ModelMetrics::mse(y_test, pred_mae)

I get that the OLS-based out-of-sample MSE is $0.00149052$ higher than the MAE-trained model's MSE, showing with real data that training on square loss does not guarantee a winning model when it comes to out-of-sample square loss, even on real data. Therefore, when your coleagues argue that training on square loss guarantees a winner on out-of-sample square loss, they are wrong.

EDIT 3

If you run the same kind of regression but with Sepal instead of Petal, you get that the model trained with OLS outperforms (by $0.005632787$) the model trained on MAE when it comes to out-of-sample MAE!

library(quantreg)
library(ModelMetrics)
set.seed(2022)

data(iris) N <- dim(iris)[1] idx <- sample(seq(1, N, 1), N, replace = F) XY <- iris[idx, ] x_test <- XY$Sepal.Length[1:20] y_test <- XY$Sepal.Width[1:20]

x_train <- XY$Sepal.Length[21:N] y_train <- XY$Sepal.Width[21:N]

L_ols <- lm(y_train ~ x_train) L_mae <- quantreg::rq(y_train ~ x_train, tau = 0.5)

pred_ols <- predict(L_ols, data.frame(y_train = y_test, x_train = x_test)) pred_mae <- predict(L_mae, data.frame(y_train = y_test, x_train = x_test))

ModelMetrics::mae(y_test, pred_ols) - ModelMetrics::mae(y_test, pred_mae)

Now you have examples, using real (not simulated) data, of an MSE-trained model outperforming an MAE-trained model on out-of-sample MAE and of an MAE-trained model outperforming an MSE-trained model on out-of-sample MSE.

Dave
  • 62,186
0

RMSE tells us how far the model residuals are from zero on average, i.e. the average distance between the observed values and the predicate values. However, Willmott et. al. suggested that RMSE might be misleading to assess the model performance since RMSE is a function of the average error and the distribution of squared errors. Chai recommended to use both RMSE and mean absolute error (MAE). It is better to report both metrics. By the way, $R^2$ is misleading as well since it increases for higher number of predictors. I would recommend to use adjusted $R^2$. This metric is kinda gold standard goodness of fit test. To handle multicollinearity, compute the nonparametric Spearman's rank-order correlation and drop the variable close to 1. It will solve the problem. For the outliers, it is not a good practice to delete outliers because they may have valuable insights. Find the influential outliers and fit the model with and without them to see their impact on the model. Also, don't forget to check all four properties of linear model assumptions. Attached articles will give u more explanations.

Reference:

http://www.jstor.org/stable/24869236

https://gmd.copernicus.org/articles/7/1247/2014/

  • 2
    It is difficult to see how any of this answer is relevant to the stated objective, "I want to check how accurate the model is in estimating or predicting the coefficients." – whuber Mar 25 '22 at 16:09
  • @whuber I guess op was model comparison questions. What are better way to compare models? – ForestGump Mar 25 '22 at 16:11
  • 1
    This question is confusing because it appears to be about improving predictions in the face of outliers etc., but then it asks specifically about estimating coefficients. When a question is confusing, ambiguous, or appears misdirected, it is better to post a comment explaining what the difficulty is and asking for clarification. – whuber Mar 25 '22 at 16:33
  • @whuber, My question, in simple words, is: if I have a dataset and this dataset contains outliers and multicollinearity, then I applied three different regression models to that dataset. I want to investigate the accuracy of these regression models based on a performance measure that is not biased towards any particular regression model. After calculating this measure, I can say with confidence that this regression model is the best one even when the dataset contains outliers and multicollinearity. – jeza Mar 25 '22 at 17:06
  • @whuber, I cannot use mean square error or R^2 because ordinary square regression will be the best, and the same with median absolute deviation, since absolute linear regression will be the best. – jeza Mar 25 '22 at 17:13
  • @jeza I never found any such things in any standard ML or classical regressions texts. Could u please give us any reference for such claim? – ForestGump Mar 25 '22 at 17:22
  • This makes little sense to me, because why not just use a regression model that is appropriate for your application? There's no bias in that. In fact, it's how statistical procedures are supposed to be selected. – whuber Mar 25 '22 at 17:40
  • @ForestGump, It will be clear if you have an idea of 'Loss Functions and Optimization'. – jeza Mar 26 '22 at 02:39
  • @whuber, simply, because it is a comparison study. – jeza Mar 26 '22 at 02:42
  • It is difficult to conceive of how examining three procedures on just one dataset could be considered a "comparison study" of those procedures. A true comparison study would explore the results of those procedures on an array of datasets of known properties. – whuber Mar 26 '22 at 14:15
  • @whuber, I say one dataset solely for the sake of simplification. Over 30 databases and a hundred simulations are included in the project. Additionally, it includes more than ten regression models. – jeza Mar 26 '22 at 16:16
  • 2
    Consider the possibility that you might have oversimplified! – whuber Mar 26 '22 at 17:01
  • 1
    This answer is based on many of the same misconceptions that prompted me to downvote another answer of yours. //@jeza Just because a particular loss function is optimized by a particular model with in-sample data does not mean that such a loss function will optimize the model out-of-sample. It is possible that a model trained using MAE loss will have lower out-of-sample MSE than a model trained using MSE loss. – Dave Mar 28 '22 at 13:57
  • @Dave, I found that when the mean square error (MSE) performance measure is used for the regression comparisons, the OLS model is the best among others. However, if median absolute deviation (MAD) is used, absolute linear regression is the best. – jeza Mar 28 '22 at 23:01
  • That might turn out to be the case with out-of-sample data, but it is not a certainty. If you have a reason to like square or absolute loss, go for the model that gives the strongest performance on that metric of interest. – Dave Mar 28 '22 at 23:06
  • @jeza You can see in my answer that what you wrote need not apply to out-of-sample testing, where an MAE-trained model can outperform an MSE-trained model, even evaluated on (out-of-sample) MSE. – Dave Mar 31 '22 at 19:23
  • @Dave, You can see the update in my question. – jeza Apr 01 '22 at 00:06
  • What’s the problem? @jeza – Dave Apr 01 '22 at 00:22
  • @Dave, when MSE is used as the performance metric, OLS is the optimal choice; when MAD is used as the performance metric, linear absolute regression is the superior choice. So, MSE and MAD cannot be used as a performance metric to investigate the accuracy of two or more different regressions because there is a bias toward specific models. Thus, I ask for a universal measure of the accuracy of linear regression models. – jeza Apr 01 '22 at 09:38
  • As you can see in my answer, the choice of training loss function does not guarantee a particular result when it comes to out-of-sample performance. Therefore, if you have a reason to be interested in a particular type of out-of-sample performance, go with the mode that does the best on that metric. If the model happens to be the model that uses the out-of-sample metric as the training loss function, so be it, but you hardly assure yourself of a particular mode winning by deciding that out-of-sample MSE is the metric of interest. Perhaps think of the training loss as a hyperparameter.. – Dave Apr 01 '22 at 09:45
  • In fact, when you tune a regularization hyperparameter to achieve the best out-of-sample performance, what you’re doing is exactly that: treating the training loss function as a hyperparameter you tune in order to achieve your goal of excellent out-of-sample performance, even at the expense of in-sample performance. I find it completely reasonable that this extends to markedly different loss functions like MSE vs MAE (as opposed to MSE vs MSE with the ridge regression penalty added), and my answer shows that can work the way I suspect it can. @jeza – Dave Apr 01 '22 at 09:52
  • @Dave, as I showed in the update, I cannot use MSE or MAD to judge which model is the best (because using MSE will let the OLS model be the best, and MAD makes the results different and the best model will be absolute linear regression). Thus, it means that I need another way to evaluate the regression model's accuracy for this kind of comparison. What would you advise me? – jeza Apr 01 '22 at 15:19
  • I’d still advise you to pick one the makes sense for the objectives of your work. If that out-of-sample metric prefers the model trained in-sample with that same metric, so be it. My simulation shows that this does not have to happen, so you’re not guaranteeing an OLS-trained model by focusing on out-of-sample square loss. That just happens to be what worked out in your case. If you’ve decided that out-of-sample MSE is the interesting metric for your problem, who cares of the loss function used for training the winning model was square loss, absolute, ridge, or something else? @jeza – Dave Apr 01 '22 at 15:33