Does IQR method for outliers work for non-normal data?

Question

Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers. However does this theory still hold when a data set is not normally distributed?

Outlier detection Formula & Method: https://online.stat.psu.edu/stat200/lesson/3/3.2#:~:text=Any%20observations%20that%20are%20more,above%20Q3%20are%20considered%20outliers.

Welcome to Cross Validated! $\text{//}$ It doesn't even work for normal distributions! — Dave, Jun 30 '22 at 17:22
@Dave In what way does it not work? Be aware that Tukey originally justified this rule by referring to a Normal distribution. Sarah: perhaps https://stats.stackexchange.com/questions/13086/is-there-a-boxplot-variant-for-poisson-distributed-data answers your question? — whuber, Jun 30 '22 at 17:37
@Dave why wouldn't it work? In this link the authors suggest "The interquartile range (IQR) is often quoted when referring to interval data that is not normally distributed." https://academic.oup.com/bjaed/article/7/4/127/466523 — maximus, Jun 30 '22 at 17:41
Thanks for the link @whuber, so based on the link, is it saying that IQR outlier detection can work with Poisson Dist too? — maximus, Jun 30 '22 at 17:42
I suspect that some of the caution from @Dave here is in not knowing the purpose to which you will put outlier detection. Sometimes apparent outliers include important information when taken together with other characteristics of the data sample. Removing outliers just because they pass some threshold isn't necessarily a good idea. — EdM, Jun 30 '22 at 18:17
@EdM Good point! I am using it to flag outliers as those are what I am trying to find. So trying to use this threshold to flag the most important points. — maximus, Jun 30 '22 at 18:24
Consider updating your question with the information that you are looking for "important" points and also what you mean by important. Do you have any idea a priori what's the proportion of important points? — dipetkov, Jun 30 '22 at 23:00

EdM · Accepted Answer · 2022-06-30T21:39:55.620

does this theory still hold when a data set is not normally distributed?

It depends on what you mean by "does this theory still hold," the nature of your data, and how strict you want to be in identifying outliers.

The frequently used rule you cite was designed to flag about 1% of normally distributed values as potential outliers. It will flag different percentages of values if your data follow different distributions.

Here are some quick examples based on 1000 random draws from each of some widely known distributions: t with 1 degree of freedom (Cauchy); t with 2 degrees of freedom; standard normal; standard lognormal.

set.seed(20220630)
t1vals <- rt(1000,1)
t2vals <- rt(1000,2)
nvals <- rnorm(1000)
lnvals <- rlnorm(1000)

Do boxplot calculations for these samples.

boxt1 <- boxplot(t1vals)
boxt2 <- boxplot(t2vals)
boxn <- boxplot(nvals)
boxln <- boxplot(lnvals)

A stats component of a boxplot object "contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker"; the first and last of those are the cutoffs for outliers as you defined them.* With 1000 values and a hoped-for 1% outliers, you would get about 10.

Here's what you find by adding up how many values in each case are below the lower whisker or above the upper whisker:

Not bad for the normally distributed data, just 15 instead of 10:

sum(nvals < boxn$stats[1,1] | nvals > boxn$stats[5,1])
# [1] 15

But for the others (lognormal, 2-df t and 1-df t in order):

sum(lnvals < boxln$stats[1,1] | lnvals > boxln$stats[5,1])
# [1] 64
sum(t2vals < boxt2$stats[1,1] | t2vals > boxt2$stats[5,1])
# [1] 83
sum(t1vals < boxt1$stats[1,1] | t1vals > boxt1$stats[5,1])
# [1] 150

So instead of flagging about 1% as outliers with that rule, you flag 6 to 15% as outliers with these distributions.** Yes, the 1-df t (Cauchy) distribution is notorious for extreme behavior, but lognormal data are frequently found.

The question thus comes down to how many true members of the distribution you want to flag as outliers, given the nature of your data.

*Well, pretty close. The "hinges" used to define the box of the boxplot, and thus the ends of the whiskers at 1.5 IQR from the box, aren't exactly at the first and third quartiles, but they are close. Compare things like summary(t1vals) for quartiles against corresponding boxt1$stats[c(2,4),1] for hinges.

**You shouldn't trust single sets of samples like this. Try repeating these types of calculations multiple times to gauge their reliability. Don't forget to set a random seed for reproducibility.

Does IQR method for outliers work for non-normal data?

1 Answers1