12

I was writing an R Markdown for calculating measures of spread and was surprised to find that the IQR function does not use the IQR calculation I am used to using, which is:

$$ \text{IQR} = \text{Q3} - \text{Q1} $$

So with a vector like this:

$$ x = \begin{bmatrix} 10,33,50,341,987,2006,2008 \end{bmatrix} $$

Q3 should be $2006$ and Q1 should be $33$. Therefore:

$$ \text{IQR} = 2006 - 33 = 1973 $$

However, when running this in R, this does not appear to be how it is derived in the software.

x7 <- c(10,33,50,341,987,2006,2008)
IQR(x7)

Which returns this:

[1] 1455

Looking through the help page, it gives this brief explanation:

Note that this function computes the quartiles using the quantile function rather than following Tukey's recommendations, i.e., IQR(x) = quantile(x, 3/4) - quantile(x, 1/4).

For normally N(m,1)N(m,1) distributed XX, the expected value of IQR(X) is 2*qnorm(3/4) = 1.3490, i.e., for a normal-consistent estimate of the standard deviation, use IQR(x) / 1.349.

My question is: why would this method be the preferred default if a lot of data is not normally distributed?

  • 6
    There are many common definitions of sample quartiles. However, Tukey didn't call the things you difference to get what you're calling "Tukey's IQR" quartiles (he called the things he defined "hinges", and he didn't call their difference by the name IQR either, he called the difference the H-spread). Tukey's hinges are available via the five number summary in R using the function fivenum. – Glen_b Feb 05 '23 at 08:11
  • 1
    Thanks a lot Glen. Pretty interesting how deep this subject is. I never learned anything other than the method I mentioned above, but I think it would be helpful for more people to know. I'll play around with fivenum today. – Shawn Hemelstrand Feb 05 '23 at 08:15
  • 1
    hspread<-function(x) {f<-fivenum(x);f[4]-f[2]} – Glen_b Feb 05 '23 at 08:24
  • 3
    For continuous distributions with everywhere positive density such as the normal, all methods to compute quantiles should give the same result for sample size going to infinity (namely the uniquely defined distribution quantiles), so normality won't make a difference between the different ways of computing quantiles. The problems/different approaches to define quantiles in case of discrete real data have nothing to do with normality or non-normality. – Christian Hennig Feb 05 '23 at 12:47
  • Good to know. Seems everybody here is in agreement that normality isnt the sticking point. – Shawn Hemelstrand Feb 05 '23 at 13:42
  • Very closely related: https://stats.stackexchange.com/questions/134229. – whuber Aug 10 '23 at 16:38

2 Answers2

26

Appropriate methods for computing quantiles are a deep rabbit hole. quantile() provides 9 separate methods for computing the quantile; of these 9, 4/9 give the result of 33 that you're expecting:

sapply(1:9, \(t) quantile(x, 0.25, type = t)) |> setNames(1:9)
       1        2        3        4        5        6        7        8 
33.00000 33.00000 33.00000 27.25000 37.25000 33.00000 41.50000 35.83333 
       9 
36.18750

The type argument can also be passed to IQR():

sapply(1:9, \(t) IQR(x, type = t))
[1] 1973.000 1973.000  954.000 1214.500 1714.000 1973.000 1455.000 1800.333
[9] 1778.750

Only 3/9 (types 1, 2, 6) give your result. (Type 7 is R's default.)

?quantile (and especially Hyndman and Fan 1996, referenced therein) give much more detail about the different definitions.

PS I don't think Normality has anything to do with it ...

While I'm at it, I would be interested if anyone could clarify (in a separate answer or in comments) what "Tukey's recommendations" are in the help for ?IQR: Exploratory Data Analysis is available for checkout at the Internet Archive, but I haven't yet located the precise advice that the R documentation is referring to ...

Ben Bolker
  • 43,543
  • 1
    Thank you that is very useful. May spend some time today reading that article. – Shawn Hemelstrand Feb 04 '23 at 05:19
  • 3
    +1 I suspect the R documentation is referring to Tukey's letter statistics, where the analog of the IQR is the "H-spread" between the two "hinges" (aka "fourths"). The hinges are not quite the quartiles, as you likely know. I discuss this at https://stats.stackexchange.com/a/134239/919. – whuber Feb 04 '23 at 15:57
7

Ben Bolker's answer explains why the quantile() function, and thus the IQR, may not return the values you expect. However, I think you've misunderstood the documentation about normality.

Only the "Type 9" method for quantile and IQR is intended for normally-distributed data, where it provides a median-unbiased estimate (docs here). The default is "Type 7".

Instead, the IQR documentation is hinting at a method for robustly estimating the standard deviation of samples from a normal distribution. You might be tempted to do so by plugging the values directly into the (Bessel-corrected) definition of standard deviation: $$ s = \sqrt{\frac{1}{n-1} \sum_{i=0}^{N} \left(x_i - \bar{x}\right)^2} $$

However, this can go badly wrong with noisy data. The mean $\bar{x}$ is extremely sensitive to outliers. In fact, even a single outlier is enough to make $\bar{x}$ take on any value.

Robust methods can provide more sensible estimates of the SD. One popular way to do so for normally distributed data is to calculate the IQR and divide it by 1.349. This trick exploits the fact that, for a standard normal distribution, the 1st and 3rd quartiles occur at $\approx \pm 0.67449$. Thus, if you know the data is normal(ish) and the IQR that it spans, you can divide that IQR by 0.67449 * 2, or 1.349, and approximate the SD. There are other robust methods based around the median absolute deviation or pairwise differences. These are useful because the order statistics are much more resilient. For example, changing one value has almost no effect on the median. Instead you would need to change >50% of them to set it to an arbitrary value.

Matt Krause
  • 21,095
  • Interesting! Thanks for clarifying. – Shawn Hemelstrand Feb 05 '23 at 01:13
  • even a single outlier is enough to make x¯ take on any value --> upon reflection, this is so obvious. But I've never thought about this way until you stated it so clearly. Thanks. – Paul H Aug 10 '23 at 15:18
  • Glad I could help! If you're interested in this sort of thing, the keywords to check out are "robust statistics" (generally) and "breakdown point" (essentially how many outliers a method can 'tolerate'). – Matt Krause Aug 10 '23 at 15:46