6

I am analyzing some data and want to look at one particular point and see how "extreme" it is.

Do I exclude this outlier from the data, calculate the dataset's standard deviation and average, then compare my outlier to THAT, or do I calculate the standard deviation and average WITH the outlier included and THEN analyze my outlier with respect to those metrics?

whuber
  • 322,774

3 Answers3

4

Actually, neither: compute how that point is extreme with respect to a robust estimator of location $l_x$ using a robust estimator of scale $s_x$. In essence, if your original point was an outlier, you will be essentially ignoring it in the computation of $(l_x,s_x)$. if your original point was not an outlier, it will have a negligible influence on $(l_x,s_x)$. Here is an article that will help you think clearly about this problem.

user603
  • 22,585
  • 3
  • 83
  • 149
  • Does it make sense to remove the outlier and then compute some statistics for the resulting distribution and then look at the outlier you removed and say, "This outlier, relative to the rest of the data, is X standard deviations above the mean" and have this be a fair comparison to make? I am basically trying to show how extreme this outlier is relative to the rest of the data. – MyNameIsKhan Aug 02 '12 at 17:02
  • 1
    @Against, that's sort of the way to do it: summarizing the distribution with a putative outlier removed works just fine when there is only one outlier. Life gets considerably more complicated when there might be more than one outlier, because the presence of one can "mask" the others. This is why the methods suggested by user603 are appealing: they do not require you to determine beforehand how many outliers there might be. – whuber Aug 09 '12 at 18:57
1

Before you get too comfortable with removing "outliers" you might want to look at the outliers dataset in the TeachingDemos package for R and work through the examples on the help page.

It would be good to look through more discussion on the topic, one place to start is wikipedia. It also includes some of the other methods of looking at outliers.

Also think about what ammunition you are giving to critics of your results if you remove outliers.

Greg Snow
  • 51,722
  • You're right, but after re-reading the question several times I still do not see that it suggests removing the outliers altogether. (I have given it a new title to help make its intent clear.) – whuber Aug 09 '12 at 18:54
0

Here's some advice available on the web : from http://www.autobox.com/cms/index.php/blog , a software site that focuses on this subject. I am involved in software development for this site.

Why don't simple outlier methods work? The argument against our competition.

For a couple of reasons:

It wasn't an outlier. It was a seasonal pulse.

The observations outside of the 2 or 3 sigma bounds could in fact be a newly formed seasonal pattern. For example, halfway through the time series June's become become very high when it had been average. Simple approaches would just remove anything outside the bounds which could be throwing the "baby out with the bathwater".

Your 3 sigma calculation was skewed due to the outlier itself.

It is a chicken and egg dilemma. The outliers make the sigma wide so that you miss outliers.

The outlier was in fact a promotion.

Using just the history of the series is not enough. You should include causals as they can help explain what is perceived to be an outlier.

Now let's consider the inlier.

There could be outliers that are within 3 sigma and let's say the observation is near the mean. When could the mean be unusual? When the observation should have been high and it just didn't for some reason.

Simple methods force the user to specify the # of times the system should iterate to remove outliers.

You are then asked how many times do you want to iterate to find the interventions by the forecasting tool? Is this intelligence or a crutch? So, you are somehow supposed to provide some empirically based guidance??? You don't know as it would be just a guess.

The reality is that Simple methods/software use a process where they assume a "mean model" to determine the outliers. The correct way is to build a model and identify the outliers at the same time. Sounds simple, right?

Does anyone have any other examples of bad outlier methodologies? or other software with their examples posted?

IrishStat
  • 29,661