Deleting outliers prior to data splitting or only in the training set?

Question

I'm working on a dataset with some outliers in the response variable which are actually natural results (not errors). I want to calibrate a model which could then be used to predict on populations outside the training dataset. Therefore, to assess its performance, I split my dataset in training and test splits by a 0.85 rate.

My question is whether the outliers should be removed prior to data splitting or after in the training dataset? I want to delete my outliers because they lower the performances of my model for what I would call common individuals.

It sounds like these outliers are outliers only in the response variable but not in any of the predictor features. I wonder if you're justified in removing those at all, since you have no way of knowing who the outliers are at the time of prediction. Basically, if you train a model this way, it is not going to represent what happens when you apply it "in the wild", since you can't avoid trying to predict people with outlier response variables (if you knew who those people were, you wouldn't need to predict the response variable in the first place). — Nuclear Hoagie, Aug 11 '23 at 13:08
you shouldn't have so many outliers that they would materially change the split in either way you do it. if you have a large number of outliers then you need to ask yourself whether they are true outliers and not a shortcomings of your model that can't handle extreme but legitimate observations — Aksakal, Aug 11 '23 at 13:13
Note that if the response variable is binary, the sample size often needs to exceed 20,000 before data splitting is reliable (as opposed to resampling methods). — Frank Harrell, Aug 11 '23 at 13:40
my response variable is continuous , thanks for your answer ! — Renaud Bied-charreton, Aug 11 '23 at 13:51
@Aksakal This is exactly the point. Since my model is struggling to predict these high values, why not just accepting that the model will be tremendously bad in these situations ? — Renaud Bied-charreton, Aug 11 '23 at 13:56
in this case don't call them outliers and leave them in the dataset so that the model error reflects the quality of its fit — Aksakal, Aug 11 '23 at 22:04

score 16 · Answer 1 · answered Aug 11 '23 at 13:15

As far as I can see, you shouldn't remove them at all. This is generally the case. Don't remove outliers just to improve your model. You might want to look at robust regression methods, but you might not.

You wrote:

i want to delete my outliers because they lower the performances of my model for what i would call common individuals.

But, no. That's cheating. You don't know, in advance, who these "common individuals" are, because the outliers are only on the response. If they were outliers on the independent variables, you might (MIGHT) be somewhat on the right path here (although I still tend to think not). You might then say "This model does not work well for people who weigh over 150 kg" or something like that. I think this is not a good way to go, but ... it sort of has some justification.

But you have outliers on the response. So, you are saying "don't use this method when it won't work. And I don't know when it won't work".

That's not just statistical cheating, it also makes your method pretty useless.

score 9 · Accepted Answer · answered Aug 11 '23 at 15:02

You can check out my answer here where I give a worked example with a dataset in R in a related way. The short answer to your query, like Peter already mentioned, is that you shouldn't delete data points like these. Here are some examples of how linear relationships can be disrupted by "outliers". None of my ideas or terms used below are new, just ways I explain them:

You observe the average ages and batting averages of baseball players in the MLB. Generally speaking, age is negatively correlated with batting average in this data. But then you then find that one player is quite old and yet has an outstanding batting average. These types of outliers I call "informative outsiders", in that they may provide a lot of information for superstar athletes or those who have peculiar characteristics not already observed by the rest of your dataset. They become particularly useful when you build larger and more varied data where you may find more of these people in the future.
Another dataset may involve about 50 recordings of volcanic eruptions that generally show a linear pattern. But you notice a few eruptions show an upward curvilinear trend. This is something I would consider a hidden trend. Here volcanic eruptions may actually exhibit an exponential relationship rather than a straight linear relationship, in which your linear regression as a whole may actually be more poorly specified than you think. Treating them like outliers would not be advisable.
You have a dataset containing several productivity patterns for employees of a Fortune 500 company. You notice somebody has 5,000,000 logged hours for the week while most only have around 40. This is data that should raise a lot more red flags, as it is commonly a data entry error and not possible within the bounds of your data (when can anybody work 5,000,000 logged hours in a week?). However, where do you draw the line? If we scaled that number down by a lot (lets say 100), it is certainly still much larger than others, but is it an unrealistic value?

You may gather that these examples show that being a subject matter expert in your domain plays a special role in making decisions based on "outliers." Most of this data is better left in and shaping an accurate model to it. In your case, the data does not look bizarre enough to warrant concern on that end, so I would leave it in as Peter already advised.

Deleting outliers prior to data splitting or only in the training set?

2 Answers2

Linked