How to compare influence of outlier in regression model. ANOVA of two models in R

Question

I am doing linear regression in R. I have identified an outlier in my data:

outliers::grubbs.test(all_pav$fmd_perc)
Grubbs test for one outlier
data:  all_pav$fmd_perc
G = 4.42003, U = 0.75274, p-value = 9.442e-05
alternative hypothesis: highest value 43.6823104693141 is an outlier

I can see this outlier in my model plots:

moda = lm (fmd_perc ~ egfr_cr_cys + tacpd + qrisk3 + homa_ir, 
           data = all_pav, na.action =na.omit)
autoplot(moda)

I then create a new model with the outlier removed from the data. I intended to compare the adjusted $R^2$ value between the two models, and then perform an ANOVA comparing the two models. That second step isn't possible because the two models have different numbers of observations, and in R I get an error:

anova(moda, new_moda)
Error in anova.lmlist(object, ...) : 
  models were not all fitted to the same size of dataset

Is my approach correct for examining whether an outlier is influential?

An outlier test on your dependent variable doesn't make much sense because you assume that this variable is conditional on the predictor variables. If you must do a hypothesis test (you really shouldn't), do it with the model residuals. — Roland, Nov 15 '23 at 07:50

score 5 · Accepted Answer · answered Nov 15 '23 at 00:13

The anova-test is not appropriate as it is for testing (nested) models with different complexity (number of parameters) on the same data (no removal of outliers allowed).

"Examining whether the outlier is influential" is pointless in the sense that for sure the outlier will have an influence, but it isn't so clear how much of a problem that is. You should certainly think about and try to find information regarding whether the outlier is actually an erroneous observation, or whether it is reliable information, and in the latter case you should be interested in that information, rather than discarding it. However even in the latter case the influence of the outlier on the analysis may be problematic as it may give a misleading picture of what goes on in the vast majority of the data.

You can find out how the influence plays out by running a regression with and without the outlier and comparing the results (looking at the full set of results rather than hoping for a single test to give you a binary assessment of little informative value).

I would in any case recommend using robust regression methods as implemented in the R-package robustbase (lmrob command and associated diagnostic plots). This will deliver less affected regression estimates without the need to remove outliers beforehand (which is always questionable if they are not clearly erroneous). Also, diagnostic plots derived from robust regression are more reliable than those derived from the LS-estimator, as the latter may themselves be affected by further outliers that are potentially not seen.

This is really helpful. I've read a bit about robust regression. Is there are factors that determine if Tukey or Huber method is more appropriate? — Mark Davies, Nov 15 '23 at 06:00
@MarkDavies These days the best established robust regression is the MM-estimator by Yohai, which improves over what Tukey and Huber originally proposed regarding the efficiency vs. robustness against leverage points trade-off. Note that the terms "Huber method" and "Tukey method" are not really well defined as both Huber and Tukey tried out many things; furthermore it may refer to a principle of estimation but also to a specific $\rho$-function. The MM-estimator in lmrob as a default uses a $\rho$-function based on a proposal by Tukey, but I don't think there's an optimality result for that. — Christian Hennig, Nov 15 '23 at 10:14

Shawn Hemelstrand · Answer 2 · 2023-11-15T03:48:13.567

Outliers and Methods

I will echo Christian's sentiments here (I've discussed this here and here). You need to determine if this is an erroneous value or simply a value that is higher/lower than the rest of the distribution. My previous answers discuss that in more detail, so I won't repeat that again, but will summarize that simple removal of a "real" value isn't usually a great option. Chapter 10 of Cohen et al., 2003 provides a very in-depth discussion on outlier detection and treatment, where some suggestions (such as Christian's) are listed as possible solutions (starting around Page 415).

I wanted to add that another alternative in addition to Christian's suggestion is running a quantile regression using the conditional median rather than the conditional mean. The median is less affected by outliers and may paint a more accurate picture of the relationship if the regression seems to not be capturing what you determine to be the real data generating process.

For some useful articles on the subject of quantile regression, see the references below.

References

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed). L. Erlbaum Associates.
Koenker, R., & Hallock, K. F. (2001). Quantile regression. Journal of Economic Perspectives, 15(4), 143–156. https://doi.org/10.1257/jep.15.4.143
Waldmann, E. (2018). Quantile regression: A short story on how and why. Statistical Modelling, 18(3–4), 203–218. https://doi.org/10.1177/1471082X18759142
Wenz, S. E. (2019). What quantile regression does and doesn’t do: A commentary on Petscher and Logan (2014). Child Development, 90(4), 1442–1452. https://doi.org/10.1111/cdev.13141

score 2 · Answer 3 · answered Nov 15 '23 at 11:40

From your question, it looks like you are searching for a tool which would tell you whether one outlier is "pulling the effect". This is what your 4th plot standardized residuals vs. leverage addresses. If you use the check_model function of performance package (see example here) you will have a yes/no answer to your question.

The DHARMa package would also give you a simple yes/no answer so as to whether your model assumptions are met (overdispersion and/or heterogeneous variances between treatment levels) with the plot(simulateResiduals) function.

Also, as said above, you cannot directly compare 2 models fitted to a different number of observations.

How to compare influence of outlier in regression model. ANOVA of two models in R

3 Answers3

Outliers and Methods

References