Can I remove sample outliers using standard deviation?

Question

I am looking to find find clinical and other measurements to predict a blood metabolite with Elastic-Net Regression models.

Can I remove samples with values greater than 1.96 SD from the mean as outliers? I read in a post stating if the samples are normal (e.g. disease-free), it could be fine.

The samples are collected from 10 different sites within the United States and then processed in the lab to test a specific cellular function, which gives a number indicating the efficiency of the process. To harmonize the data, I thought removing the outliers are important. But I am still not sure this is a good approach. I noticed a non-linear model performance significantly improved when removing samples greater than 2 SD. And when only using values within 1 SD of the mean (really cuts down a lot of samples), mean squared error is reduced as well.

The sample size is approximately 6000.

The goal is to perform both non-linear regression and the linear repeated ElasticNet models.

This is a poor idea for many reasons. One is that when outliers truly are present, they inflate your estimated SD, making it more difficult to identify them. Thus, there are two questions here: (1) In what sense would removing outliers "harmonize" the data? (2) If you really need to identify outliers, how best to do it? The answer to (2) depends partly on (1) along with your expectations, including whether you need to estimate the number of outliers; whether they are high, low, or in either direction; and what your probability model for the data is. — whuber, May 16 '22 at 15:46
The paper you cite is for "an interlaboratory studv in which each laboratory uses the defined method of analysis to analyze identical portions of homogeneous materials to assess the performance characteristics obtained for that method of analysis." That's not the type of "collaborative study" that you seem to have. Please edit your question to say more about the nature of the metabolite measurement, the number and nature of covariates for the regression, the number of samples, and the number of sites involved. There are better ways to evaluate and deal with outliers, depending on such details. — EdM, May 16 '22 at 16:11
@EdM - thank you for the suggestion. I just updated the post. — Molly_K, May 16 '22 at 17:49
You might want to consider the difference between trimming (excluding) the outliers and 'windsorizing'. See https://en.wikipedia.org/wiki/Winsorizing or elsewhere on the internet for details. — Stephen Wood, May 17 '22 at 17:21

Billy · Accepted Answer · 2022-05-16T20:18:14.220

The use of a standard deviation-based threshold for "outlier" detection is generally not a good idea. I think there may be some confusion as to what "normal data" mean as this refers to the statistical distribution of the data, not a qualitative description of the population from which they derive. Your post suggests to me that there may have been a confusion that being from a disease-free population meant the data were "normal" when, in fact, the normality discussed in the post you linked is that of the standard normal distribution in statistics.

Perhaps the link was not the correct one as I also did not see anywhere there where the recommendation was made to omit cases as outliers when they occur beyond some number of standard deviations from the mean. This criterion doesn't make sense for outlier detection because we expect there to be values of certain extremes as a function of the normal distribution. In other words, by the definition of the normal distribution, we expect ~5% of the sample data to fall outside of 1.96 standard deviations from the mean. This does not make them outliers, they just make them rarer "extremes" in the distribution. This is before considering the issue raised by @whuber wherein the presence of outliers will increase the standard deviation anyway.

Now, to the issue of your noted model performance change when omitting the "outliers." The general gist of linear regression models is to predict some kind of a conditional mean (with some caveats with respect to simplification obviously). When extreme cases are omitted, then we are left with cases whose central tendencies are all relatively alike with reduced variation. You mention that the MSE improves when omitting those cases beyond certain standard deviations, which is an almost guaranteed because you are selectively omitting cases that will have large deviations from the mean. Thinking about the equation for MSE, residuals that are very large get squared (to make them positive) and thus get even larger, and these very large residuals are more likely in cases where the raw data are far from the mean to begin with. The MSE thus is a biased indicator of model performance (in this case), and I'd recommend looking at things like the predictive distribution plots to see whether the model is actually makes realistic predictions of the data rather than just how large residuals are on average.

To the question of outliers, you may consider thinking about identifying influential cases on the model and formal outlier detection methods. There are many univariate and multivariate outlier tests, but the overall identification of outliers is sometimes questionable as it may be better to think about outliers as arising from unique data generation processes rather than providing irrelevant information about the model. When outliers represent clearly incorrect data (e.g., data entry error, experimenter issue, out-of-range value), then it is more justifiable to remove those observations. It sounds like you may be concerned specifically with outliers caused by differences in your sites. If that's the case, then you may transition to multilevel models where each site is a grouping variable that can have random intercepts and slopes. This gets back to, ultimately, choosing a model that reflects your beliefs about what is causing the data you've observed.

score 9 · Answer 2 · answered May 16 '22 at 20:55

The answer from @Billy (+1) gets to the critical points of the question you posed. These are just a few further thoughts on your modeling strategy that are too extensive to fit into comments.

First, from what you describe it's not clear what you will gain with elastic net. With 6000 cases and what seems to be an outcome that takes on continuous values, you have a lot of flexibility in fitting your model without the variable omission and coefficient penalization involved in elastic net. By usual rules of thumb for biomedical studies, you could evaluate 300 or more predictors in a regression model without much risk of overfitting the model (a case/predictor ratio of 20). If you have thousands of predictors, like with RNA sequencing (RNAseq) data, elastic net might make sense--depending on how you want to apply your model in the future.

Second, it's not clear what you mean precisely by a "non-linear model" in this context. Some models that appear to be non-linear, like fitting outcomes to polynomial functions of predictors, are still "linear models" insofar as the models are linear in the regression coefficients. Sometimes you need a truly non-linear model, but linear modeling can cover a remarkably wide range of applications. You can use regression splines to model predictors flexibly, do non-linear transformations of variables before linear regression (like the log transform often used for RNAseq data), or use generalized linear models to have a nonlinear mapping between a linear-model predictor function and outcome. Those are all still considered linear models in an important technical sense.

Consider whether you really need a non-linear model for your application. If you can perform your "non-linear" modeling in the context of generalized linear models and you do need to use elastic net, standard tools allow you to do that together instead of separately.

Third, remember that extreme values aren't necessarily "outliers" if the values of the associated predictor variables are also appropriately extreme. What is of concern is when differences between the observed and the model-predicted values (the residuals) are large or vary systematically. You certainly don't want to be removing extreme values as "outliers" at an early stage of analysis unless you know the values to have some technical error.

Fourth, do be sure to include your sites as predictors in the model. Even if the biochemical assays were all performed at the same central location, it's possible for differences among sites in sample handling, patient characteristics, etc. to be important in a way that requires some form of statistical control.

The search function on this site can lead you to much information about these issues. If you don't find an answer that helps with future questions, ask further focused questions. See this help page for ways to write questions that can help both you and other visitors to the site.

Can I remove sample outliers using standard deviation?

2 Answers2